ejabberd cluster load test performance bad - performance

I test single ejabberd(4cpu 8GB Cnetos7) by Tsung, 6000 simultaneous users to login then join MUC room to talk every ten seconds, it shows CPU Load over mean 2.5 CPU % over mean 40%; but when I have two nodes(4cpu 8GB Linux) ejabberd cluster, I test 12000 usrs, two ejabberd nodes shows CPU Load over mean 3.5 CPU % over mean 75%; When I have three nodes ,It shows higher in CPU performance in 12000 usrs test, Who can help me answer, thans.

Related

kubernetes and/or terraform configuration issue causing performance degradation?

I'm migrating an e2e test stack from docker compose based setup to kubernetes. As part of this migration, I'm also creating terraform modules for individual services that make up a product.
A single e2e stack - ATM - is composed of ~50 pods and starts up in out 5 minutes (I run dedicated DBs, in-memory data stores, esb integration tools, external mocked services, etc.., per stack, hence the high number of pods).
During testing I would like to start up as many of those complete stacks as possible.
Currently I have a k8s cluster with 9 nodes:
six 64GB RAM, 512GB SSD, with latest gen i5 CPUs (max-pods with default 110) and
three 256GB RAM, 1TB SSD, with 18 core Xeon CPU (max-pods set to 330 on each) nodes.
I'm using my terraform modules to start up the stacks (the modules mostly define kubernetes resources).
My expectation would be that I can fire up ~30 stacks in parallel without major hiccups. I also would expect that startup times follow - to a reasonable extent - a function like st = ((#s + #n) % #n) * ss-st, where:
st denotes the overall startup time,
#s denotes number of stacks,
#n denotes number of nodes and finally
ss-st denotes single stack startup time.
However, reality is very different. Blue columns shows actual startup times (measured in seconds), while red shows, what my idealized expectation is.
Installed Prometheus operator and I have some metrics, but non explain to me (yet) where is the bottleneck in this case (disk utilization seems to get maxed out on master nodes from time to time, but they alone does not seems to explain the end figures).
What am I doing wrong?

JMeter Test setup(Master & Slave) on Azure VM to trigger 3000 concurrent users

can one share some inputs/links to JMeter(Master & Slave) on Azure VM. What is the VM configuration(Master & Slave) to be used to trigger 3000 concurrent users.
We don't know as it depends on the nature of your test plan,number of requests per second, size of requests/responses, number of pre/post processors, assertions, etc.
You need to measure it, i.e.
Implement your test plan
Run the test starting from 1 user and gradually increasing the load up to 3000 at the same time looking at resources consumption via Azure Monitor or JMeter PerfMon Plugin
When any of monitored metrics like CPU, RAM or Network usage starts exceeding reasonable threshold, i.e. 80% of total available capacity take a look how many virtual users were online at this stage using i.e. Active Threads Over Time plugin
If the number of users is 3000 - you're good to go with this VM
If the number of users is less than 3000 - consider either increasing the size of the VM or adding a new VM of the current size as the slave (or several machines, depending on how many users you were able to kick off)

Elasticsearch Ram recommendation

I'm deploying an Elasticsearch cluster with roughly 40GB a day with a time-to-live of 365 days. Write speed would be around 50 msgs/sec. Reads would be mostly driven by user dashboards, so the read frequency won't be high. What would be the best hardware requirements for this amount of data? How many master and data nodes may require in this situation?
obviously base on search index rate you should choose the hardware. 50 msg/sec is very low for elasticsearch. you have total 14.6TB data that is your 85 percent of total disk (base on 85% watermark). this means that you need 17TB disk. I think you can use one server with 128GB RAM and atleast 10 Core CPU and 17TB disk or have two server with half of this config. one server is master and data node and one server will be only data node.

What is the suitable number of queue workers?

I was wondering if there is a relation between number of queue workers and CPU or RAM resources. I noticed that Laravel defaults to 8-10 workers, however on my own experience I've tried once to increase them to 50 workers and what a huge and fast performance I get with my Digitalocean VPS 8GB RAM and 4CPUs compared to only 10 workers.
So is there any relation between there number and resources ?

Understanding RESTful Web Service stress test results

I'm trying to stress-test my Spring RESTful Web Service.
I run my Tomcat server on a Intel Core 2 Duo notebook, 4 GB of RAM. I know it's not a real server machine, but i've only this and it's only for study purpose.
For the test, I run JMeter on a remote machine and communication is through a private WLAN with a central wireless router. I prefer to test this from wireless connection because it would be accessed from mobile clients. With JMeter i run a group of 50 threads, starting one thread per second, then after 50 seconds all threads are running. Each thread sends repeatedly an HTTP request to the server, containing a small JSON object to be processed, and sleeping on each iteration for an amount of time equals to the sum of a 100 milliseconds constant delay and a random value of gaussian distribution with standard deviation of 100 milliseconds. I use some JMeter plugins for graphs.
Here are the results:
I can't figure out why mi hits per seconds doesn't pass the 100 threshold (in the graph they are multiplied per 10), beacuse with this configuration it should have been higher than this value (50 thread sending at least three times would generate 150 hit/sec). I don't get any error message from server, and all seems to work well. I've tried even more and more configurations, but i can't get more than 100 hit/sec.
Why?
[EDIT] Many time I notice a substantial performance degradation from some point on without any visible cause: no error response messages on client, only ok http response messages, and all seems to work well on the server too, but looking at the reports:
As you can notice, something happens between 01:54 and 02:14: hits per sec decreases, and response time increase, okay it could be a server overload, but what about the cpu decreasing? This is not compatible with the congestion hypothesis.
I want to notice that you've chosen very well which rows to display on Composite Graph. It's enough to make some conclusions:
Make note that Hits Per Second perfectly correlates with CPU usage. This means you have "CPU-bound" system, and the maximum performance is mostly limited by CPU. This is very important to remember: server resources spent by Hits, not active users. You may disable your sleep timers at all and still will receive the same 80-90 Hits/s.
The maximum level of CPU is somewhere at 80%, so I assume you run Windows OS (Win7?) on your machine. I used to see that it's impossible to achieve 100% CPU utilization on Windows machine, it just does not allow to spend the last 20%. And if you achieved the maximum, then you see your installation's capacity limit. It just has not enough CPU resources to serve more requests. To fight this bottleneck you should either give more CPU (use another server with higher level CPU hardware), or configure OS to let you use up to 100% (I don't know if it is applicable), or optimize your system (code, OS settings) to spend less CPU to serve single request.
For the second graph I'd suppose something is downloaded via the router, or something happens on JMeter machine. "Something happens" means some task is running. This may be your friend who just wanted to do some "grep error.log", or some scheduled task is running. To pin this down you should look at the router resources and jmeter machine resources at the degradation situation. There must be a process that swallows CPU/DISK/Network.

Resources