Load balancing in Tomcat in 64-bit OS - performance

I have a web application and required load balancing in Tomcat.
My hardware spec is 5 unit of 32 GB RAM quad-core processors 64 bit OS.
Should I have
5 Tomcats, one in each machine with -Xmx around 30GB and higher maxThreads
A bigger number of Tomcats with lower -Xmx, e.g. 25 Tomcats, 5 in each machine with -Xmx around 6GB and default maxThreads
I have load balancer using mod_proxy_balancer before Tomcats. Let assume there is no bottle neck at database layer to simplify the situation.
Let me know if you need more info.
Thanks.

I see no benefit to having multiple Tomcats per machine, unless they happened to be serving different things and you wanted to take them down independently. Or if you anticipate crashes and you want to make sure at least some instances on a machine are isolated. Other than that, there's no apparent benefit. Certainly not from a performance standpoint, which is what I think you're asking.
There's a big downside to multiple Tomcats, which is that each has a fixed amount of memory. If your app happens to need a larger amount for a short period of time, you'll get Out of Memory errors. There's a greater margin of error when you have one big memory space.
Also, you'll find monitoring to be a bigger hassle with a larger number of instances.
If you've got multiple Tomcats, then they all have to run on different ports, or you've got to front them with something on each machine. In either case, it leads to unnecessary complexity.
I vote for one Tomcat per machine.

Related

Spring Boot High Heap Usage

We have a spring boot application that runs on 20 servers and we have a balancer that redirects the requests to our servers.
Since last week we are having huge problems with CPU usage (100% in all VM's) almost simultaneously without having any noticeable increase in the incoming requests.
Before that, we had 8 VM's without any issue.
In peak hours we have 3-4 thousand users with 15-20k requests per 15 minutes.
I am sure it has something to do with the heap usage since all the CPU usage comes from the GC that tries to free up some memory.
At the moment, we isolated some requests that we thought might cause the problem to specific VM's and proved that those VM's are stable even though there is traffic. (In the picture below you can see the OLD_gen memory is stable in those VM's)
The Heap memory looked something like this
The memory continues to increase and there are two scenarios, it will either reach a point and after 10 minutes it will drop on its own at 500MB or it will stay there cause 100% CPU usage and stay there forever.
From the heap dumps that we have taken, it seems that most of the memory has been allocated in char[] instances.
We have used a variety of tools (VisualVM, JProfiler etc) in order to try to debug the issue without any luck.
I don't know if I am missing something obvious, or something else.
I also tried, to change GC algorithm to G1 from the default and disable hibernate query cache plan since a lot of our queries are using the in parameter for filtering.
UPDATE
We managed to reduce the number of requests in our most heavily used API Call and the OLD gen looks like that now. Is that normal?

Sudden performance degradation running the same program in multiple instances on AMD ThreadRipper

Situation is:
One and the same native, standalone (MFC, SDI) application is started in multiple instances on a 32 Core AMD ThreadRipper 3970X, 4400 MHz, Castle Peak CPU
Each instance performs its main work load (neural network computations) in a worker thread and utilizes minimal additional CPU load in mainly two other threads (scripting and GUI/UI Thread)
All instances are performing the same type of work based on the same application script which is run in all the instances simultaneously. Work share is based on different input files the instances process.
The instances do not produce any graphic workload, actually they all run minimized
The instances do not communicate with each other
Each instance calculates its performance (in terms of processed work per time) continuously.
The instances do not access the hard disk in any significant way. Just performance reporting every 10 s in a few lines of text file.
Now here comes the problem:
When more than ~10 instance are started the performance of each single instance suddenly goes down massively, so that starting up more instances results in no increase of the overall computation performance anymore
The overall CPU load doesn't increase anymore, the single cores are less and less used the more instances are running
Below 10 instances everything scales up quite well
On machines with less cores (tried on several machines with 2, 4 and 8 cores) everything scales nicely, too, and the CPU is finally 100 % busy when enough instances are running
Each instance allocates about 3.2 MB of RAM when started and approx. 8,5 MB when it has loaded its files and is busy
--> Any idea what could be the limiting factor here or what I could try to find out?
My main thoughts are around the processor Cache. The ThreadRipper has 128 MB of L3 Cache shared by all cores. Once the Cache is eaten up by ~10 instances this could slow down the performance since the cores must access the external RAM then, right?
Then, however, the machines with fewer cores than the ThreadRipper are able to perform the same amount of work per core as the ThreadRipper does. Although they have significantly less Cache per core than the ThreadRipper and even one instance would eat up all their L3 core.
Additional question: The application is statically linked to MFC. Would it be beneficial for the Cache management/usage if I got it linked dynamically? I already tried but am still getting many linker errors since multiple other libs are included, too.
Many thanks for all hints/ideas.

How to remove overhead caused by increasing number of CPU cores

We have a situation where in a server with 8 core initializes a service in 6 minutes and the other server with 32 cpu-cores takes 35 minutes to initialize the service. After profiling it properly we have seen that it is because of two kernel APIs(get_counters and snmp_fold_field) which tries to collect data from each existing cores and as the number of cpu-cores increases execution time takes longer than expected.In order to reduce initialization time we thought to make extra cores disabled and later to initialization enable all the cpu cores.But in this approach also once we enable all cores synchronization happens on newly enabled cores as this is SMP kernel.
Can someone suggest us how to reduce overhead caused by increased CPU cores efficiently?
Instead code i would rather explain initialization functionality of this user defined system service.During its initialization this service plumbs virtual interfaces on configured IPs. To avoid overlapping IPs situation for each configured IP it creates an internal IP and all communication is done on the interfaces plumbed on internal IP.As the packet reaches to system with destination as configured IP , Mangling/NATting/Routing table rules are applied on the system to mange it.
An interface is also plumb for configured Ip to avoid IP forwarding.Our issue is when we scale our system configured for 1024 IPs on 8 core machine it takes 8 minutes and on 32 cores it takes 35 minutes.
On further debugging done using system profiling We saw that ARPtables/IPtables's kernel module is consuming most time in "get_counters()" and IP's kernel module is consuming time in snmp_fold_field(). If i simply disable ARPtables Mangling rules time drops to 18 minutes from 35 minutes.I can share the kernel modules's callstacks collected using profiler.
A common bottleneck in applications utilizing multi-core is uncontrolled/-optimized access to shared resources. The biggest performance killer is the bus-lock (critical sections) followed by accesses to RAM, accesses to the L3 cache and accesses to the L2 cache. As long as a core is operating inside its own L1 (code and data) caches the only bottlenecks will be due to poorly-written code and badly-organized data. When two cores need to go outside their L1 caches they may collide when accessing their (shared) L2 cache. Who gets to go first? Is the data even there or must we move down into L3 or RAM to access it? For every further level required to find the data you're looking at a 3-4x time penalty.
Concerning your application it seems to have worked reasonably well because eight cores can only collide to a certain degree when accessing the shared resources. With thirty-two cores the problem becomes more than four times as large because you'll have four times the strain on the resource and four times as many cores to resolve who gets to access first. It's like a doorway: as long as a few people run in or out through the doorway now and then there's no problem. Someone approaching the doorway might stop a little to let the other guy through. Suddenly you have a situation where many people are running in and out and you'll not only have congestion, you'll have a traffic jam.
My advice to you is to experiment with restricting the number of threads (which will translate into number of cores) your application may utilize at any one time. The six minute startup time on eight cores could serve as a benchmark. You may find that it will execute faster on four cores than eight, not to mention thirty-two.
If this is the case and you still insist on running faster on thirty-two cores than you are today you're looking at an enormous amount of work both making the SNMP-server code more efficient and organizing your data to lessen the load on shared resources. Not for the faint-of-heart. You may end up re-writing the entire application.

Running test cluster cassandra/dse on minimal hardware?

I recently came across some of these rugged computers that one of my predecessors had left in our office. Unfortunately they appear to be maxed out at 4GB RAM, and single core 2.67 GHz processors.
There are four of them, and my first thought was to create a test environment that would mimic our production environment, albeit on a MUCH smaller scale due to hardware (obviously).
I guess my question is will this work, or be a waste of time? In the datastax enterprise documentation a minimum of 16 GB of RAM is recommended (with an 8G heap), however the recommendation is for production. Has anyone run a small little test cluster on minimal hardware before?
Yes - you can run a test cluster on such hardware. Check this out: Multi-Datacenter Cassandra on 32 Raspberry Pi’s
Of course you will need to give it a much smaller heap size and new generation size too, and of course if you give it a heavy load, it will OOM most likely. But you can still run it under a light loads.

JVM memory tuning for eXist

Suppose you had a server with 24G RAM at your disposal, how much memory would you allocate to (Tomcat to run) eXist?
I'm setting up our new webserver, with an Intel Xeon E5649 (2.53GHz) processor, running Ubuntu 12.04 64-bit. eXist is running as a webapp inside Tomcat, and the db is only used for querying 'stable' collections --that is, no updates are being executed to the resources inside eXist.
I've been experimenting with different heap sizes (via -Xms and -Xmx settings when starting the Tomcat process), and so far haven't noticed much difference in response time for queries against eXist. In other words, it doesn't seem to matter much whether the JVM is allocated 4G or 16G. I have also upped the #cachesize and #collectionCache in eXist's WEB-INF/conf.xml file to e.g. 8192M, but this doesn't seem to have much effect. I suppose these settings /do/ have an influence when eXist is running inside Tomcat?
I know each situation is different (and I know there's a Tomcat server involved), but are there some rules of thumb for eXist performance w.r.t. the memory it is allocated? I'd like to get at a sensible memory configuration for a setup with a larger amount of RAM available.
This question was asked and answered on the exist-open mailing list. The answer from wolfgang#exist-db.org was:
Giving more memory to eXist will not necessarily improve response times. "Bad"
queries may consume lots of RAM, but the better your queries are optimized, the
less RAM they need: most of the heavy processing will be done using index
lookups and the optimizer will try to reduce the size of the node sets to be
passed around. Caching memory thus has to be large enough to hold the most
relevant index pages. If this is already the case, increasing the caching space
will not improve performance anymore. On the other hand, a too small cacheSize
of collectionCache will result in a recognizable bottleneck. For example, a
batch upload of resources or creating a backup can take several hours (instead
of e.g. minutes) if #collectionCache is too small.
If most of your queries are optimized to use indexes, 8gb RAM for eXist does
usually give you enough room to handle the occasional high load. Ideally you
could run some load tests to see what the maximum memory use actually is. For
#cacheSize, I rarely have to go beyond 512m. The setting for #collectionCache
depends on the number of collections and documents in the database. If you have
tens or hundreds of thousands of collections, you may have to increase it up to
768m or more. As I said above, you will recognize a sudden breakdown in
performance during uploads or backups if the collectionCache becomes too small.
So to summarize, a reasonable setting for me would be: -Xmx8192m,
#cacheSize="512m", #collectionCache="768m". If you can afford giving 16G main
memory it certainly won’t hurt. Also, if you are using the lucene index or the
new range index, you should consider increasing the #buffer setting in the
corresponding index module configurations in conf.xml as well:
<module id="lucene-index" buffer="256" class="org.exist.indexing.lucene.LuceneIndex" />
<module id="range-index" buffer="256" class="org.exist.indexing.range.RangeIndex"/>

Resources