WebSphere Liberty auto scaling CPU metric - websphere-liberty

WAS Liberty has auto-scaling feature and one of the metrics is the CPU metric.
My question is about the case when I have many JVMs being members of different clusters running on the same host. Is Liberty measuring the CPU consumption of each JVM or is it per host? If it is host-wide, then scaling decisions are almost meaningless because how does it know which JVM is actually consuming all the CPU time and needs to be scaled out? If I have 5 JVMs and 4 of those are not doing any work and one JVM is taking 99% of my CPU on the host, I only need to scale that JVM, not all five of them.

The manual says metrics trigger on averages across a cluster, so for the scenario you outline:
If they are all in a single cluster, there is really no scaling action required and it wouldn't likely help.
If the spinning JVM is in a single or very small cluster (w/o the other 4) then it could trip the CPU metric depending on the average %cpu and the configuration

The usage of the resources is compared against its threshold at each level (JVM, Host and Cluster). When the usage goes beyond the threshold at one of the levels, the scale out action is executed.

Related

Kubernetes number of replicas vs performance

I have just gotten into Kubernetes and really liking its ability to orchestrate containers. I had the assumption that when the app starts to grow, I can simply increase the replicas to handle the demand. However, now that I have run some benchmarking, the results confuse me.
I am running Laravel 6.2 w/ Apache on GKE with a single g1-small machine as the node. I'm only using NodePort service to expose the app since LoadBalancer seems expensive.
The benchmarking tool used are wrk and ab. When the replicas is increased to 2, requests/s somehow drops. I would expect the requests/s to increase since there are 2 pods available to serve the request. Is there a bottleneck occurring somewhere or perhaps my understanding is flawed. Do hope someone can point out what I'm missing.
A g1-small instance is really tiny: you get 50% utilization of a single core and 1.7 GB of RAM. You don't describe what your application does or how you've profiled it, but if it's CPU-bound, then adding more replicas of the process won't help you at all; you're still limited by the amount of CPU that GCP gives you. If you're hitting the memory limit of the instance that will dramatically reduce your performance, whether you swap or one of the replicas gets OOM-killed.
The other thing that can affect this benchmark is that, sometimes, for a limited time, you can be allowed to burst up to 100% CPU utilization. So if you got an instance and ran the first benchmark, it might have used a burst period and seen higher performance, but then re-running the second benchmark on the same instance might not get to do that.
In short, you can't just crank up the replica count on a Deployment and expect better performance. You need to identify where in the system the actual bottleneck is. Monitoring tools like Prometheus that can report high-level statistics on per-pod CPU utilization can help. In a typical database-backed Web application the database itself is the bottleneck, and there's nothing you can do about that at the Kubernetes level.

Apache Nifi slow cluster issue

I am using a Apache nifi for one of my clickstream projects to do some ETL.
I am getting traffic around 300 messages per second currently with the following infra:
RAM - 16 GB
Swap - 6 GB
CPU - 16 cores
Disk - 100GB (Persistance not required)
Cluster - 6 nodes
The entire cluster UI has become extremely slow with the following issues
Processors giving back pressure when some failure happens, which consumes lot of threads
Provenance writing becomes very slow
Heartbeat across nodes becomes slow
Cluster Heart beat
I have the following questions on the setup
Is RPG use recommended, as it is a HTTP call, which i using to spread
across all the nodes, as there is an existing issue with EMQTT
process for consumer group.
What is the recommended value of thread count that should be allotted
per core?
What are the guidelines for infrastructure sizing
What are the tuning parameters for a large cluster with high incoming requests and lot of heavy JSON parsing for transformation
A couple of suggestions
Yes RPG usage is recommended, at least from what I've experienced, RPG seems to offer better distribution. Take a look at [3] below
Some processors are CPU intensive then others so there's no clear cut answer for what value can be set for Concurrent Tasks. This is more of trial and error or testing and fine tuning approach that you'd have to master. One suggestion is, if you set too many Concurrent Tasks for a CPU intensive processor, it will have serious impact on the nodes.
Hortonworks have made a detailed guide regarding this. I've provided the link below. [1]
Some best practices and handy guides:
https://community.hortonworks.com/articles/7882/hdfnifi-best-practices-for-setting-up-a-high-perfo.html
http://ijokarumawak.github.io/nifi/2016/11/22/nifi-jolt/
https://pierrevillard.com/2017/02/23/listfetch-pattern-and-remote-process-group-in-apache-nifi/

Detecting CPU load on different machines

I am trying to create a background task scheduler for my process, which needs to schedule the tasks(compute intensive) parallelly while maintaining the responsiveness of the UI.
Currently, I am using CPU usage(percentage) to against a threshold (~50%) for the scheduler to start a new task, and it sort of works fine.
This program can run on a variety hardware configurations( e.g processor speed, number of cores), so 50% limit can be too harsh or soft for certain configurations.
Is there any good way to include different parameters of CPU configuration e.g cores, speed; which can dynamically come up with a threshold number based on the hardware configuration?
My suggestions:
Run as many threads as CPUs in the system.
Set the priority of each thread to an idle (lowest)
In the thread main loop do a smallest sleep possible, i.e. usleep(1)

Impact of more executors than cpu/cores in Storm Cluster

I have started using Apache Storm recently. Right now focusing on some performance testing and tuning for one of my applications (pulls data from a NoSQL database, formats, and publishes to a JMS Queue for consumption by the requester) to enable more parallel request processing at a time. I have been able to tune the topology in terms of changing no. of bolts, MAX_SPENDING_SPOUT etc. and to throttle data flow within topology using some ticking approach.
I wanted to know what happens when we define more parallelism than the no of cores we have. In my case I have a single node, single worker topology and the machine has 32 cores. But total no of executors (for all the spouts and bolts) = 60. So my questions are:
Does this high number really helps processing requests or is it actually degrades the performance, since I believe there will more context switch between bolt tasks to utilize cores.
If I define 20 (just a random selection) executors for a Bolt and my code flow never needs to utilize the Bolt, will this be impacting performance? How does storm handles this situation?
This is a very general question, so the answer is (as always): it depends.
If your load is large and a single executor fully utilizes a core completely, having more executors cannot give you any throughput improvements. If there is any impact, it might be negative (also with regard to contention of internally used queues to which all executers need to read from and write into for tuple transfer).
If you load is "small" and does not fully utilize your CPUs, it wound matter either -- you would not gain or loose anything -- as your cores are not fully utilized you have some left over head room anyway.
Furthermore, consider that Storm spans some more threads within each worker. Thus, if your executors fully utilize your hardware, those thread will also be impacted.
Overall, you should not run your topologies to utilize core completely anyway but leave form head room for small "spikes" etc. In operation, maybe 80% CPU utilization might be a good value. As a rule of thumb, one executor per core should be ok.

How to measure resource usage on partitionlevel in Service Fabric?

With Service Fabric we get the tools to create custom metrics and capacities. This way we can all make our own resource models that the resource balancer uses to execute on runtime. I would like to monitor and use physical resources such as: memory, cpu and disk usage. This works fine as long as we keep using the default load.
But Load is not static for a service/actor, so I would like to use the built-in Dynamic load reporting. This is where I run into a problem, ReportLoad works on the level of partitions. However partitions are all within the same process on a Node. All methods for monitoring physical resources I found are using process as the smallest unit of measurement, such as PerformanceCounter. If this value would be used there could be hunderds of partitions reporting the same load and a load which is not representative of the partition.
So the question is: How can the resource usage be measured on partition level?
Not only are service instances and replicas hosted in the same process, but they also share a thread pool by default in .NET! Every time you create a new service instance, the platform actually just creates an instance of your service class (the one that derives from StatefulService or StatelessService) inside the host process. This is great because it's fast, cheap, and you can pack a ton of services into a single host process and on to each VM or machine in your cluster.
But it also means resources are shared, so how do you know how much each replica of each partition is using?
The answer is that you report load on virtual resources rather than physical resources. The idea is that you, the service author, can keep track of some measurement about your service, and you formulate metrics from that information. Here is a simple example of a virtual resource that's based on physical resources:
Suppose you have a web service. You run a load test on your web service and you determine the maximum requests per second it can handle on various hardware profiles (using Azure VM sizes and completely made-up numbers as an example):
A2: 500 RPS
D2: 1000 RPS
D4: 1500 RPS
Now when you create your cluster, you set your capacities accordingly based on the hardware profiles you're using. So if you have a cluster of D2s, each node would define a capacity of 1000 RPS.
Then each instance (or replica if stateful) of your web service reports an average RPS value. This is a virtual resource that you can easily calculate per instance/replica. It corresponds to some hardware profile, even though you're not reporting CPU, network, memory, etc. directly. You can apply this to anything that you can measure about your services, e.g., queue length, concurrent user count, etc.
If you don't want to define a capacity as specific as requests per second, you can take a more general approach by defining physical-ish capacities for common resources, like memory or disk usage. But what you're really doing here is defining usable memory and disk for your services rather than total available. In your services you can keep track of how much of each capacity each instance/replica uses. But it's not a total value, it's just the stuff you know about. So for example if you're keeping track of data stored in memory, it wouldn't necessarily include runtime overhead, temporary heap allocations, etc.
I have an example of this approach in a Reliable Collection wrapper I wrote that reports load metrics strictly on the amount of data you store by counting bytes: https://github.com/vturecek/metric-reliable-collections. It doesn't report total memory usage, so you have to come up with a reasonable estimate of how much overhead you need and define your capacities accordingly, but at the same time by not reporting temporary heap allocations and other transient memory usage, the metrics that are reported should be much smoother and more representative of the actual data you're storing (you don't necessarily want to re-balance the cluster simply because the .NET GC hasn't run yet, for example).

Resources