I've got a Laravel application running with about 800-1.000 concurrent users (10.000 users overall). The main time frame of use is from 8 am to 6 pm. However, the server's memory is increasing constantly (see attachment) and reaches its high at about 10 pm (about 100 concurrent users).
I assume that query logging causes the constant increase of memory. I tried to stop it by including DB::disableQueryLog(); into the boot function in the AppServiceProvider.php. However, this doesn't seem to stop query logging. How could I fix this problem?
Related
I'm using jmeter to generate a performance test, to keep things short and straight i read the initial data from a json file, i have a single thread group in which after reading the data i randomize certain values to prevent data duplication when i need it, then i'm passing the final data to the endpoint using variables, this will end up in a json body that is recieved by the endpoint and it will basically generate a new transaction in the database. Also i added a constant timer to add a 7 seconds delay between requests, with a test duration of 10 minutes and no ramp up, i calculated the requests per second like this:
1 minute has 60 seconds and i have a delay of 7 seconds per request then it's logical to say that every minute i'm sending approximately 8.5 requests per minute, this is my calculation (60/7) = 8.5 now if the test lasts for 10 minutes then i multiply (8.5*10) = 85 giving me a total of 85 transactions in 10 minutes, so i should be able to see that exact same amount of transactions created in the database after the test completes.
This is true when i'm running 10-20-40 users, after the load test run i query the db and i get the exact same number of transaction however, as i increase the users in the thread group this doesn't happen anymore, for example if i set 1000 users i should be able to generate 8500 transactions in 10 minutes, but this is not the case, the db only creates around 5.1k transactions.
What is happening, what is wrong? Why it initially works as expected and as i increase the users it doesn't? I can provide more information if needed. Please help.
There could be 2 possible reasons for this:
You discovered your application bottleneck. When you add more users the application response time increases therefore throughput decreases. There is a term called saturation point which stands for the maximum performance of the system, if you go beyond this point - the system will respond slower and you will get less TPS than initially. From the application under test side you should take a look into the following areas:
It might be the case your application simply lacks resources (CPU, RAM, Network, etc.), make sure that it has enough headroom to operate using i.e. JMeter PerfMon Plugin
Your application middleware (application server, database, load balancer, etc.) are not properly set up for the high loads. Identify your application infrastructure stack and make sure to follow performance tuning guidelines for each component
It is also possible that your application code needs optimization, you can detect the most time/resource consuming functions, largest objects, slowest DB queries, idle times, etc. using profiling tools
JMeter is not sending requests fast enough
Just like for the application under test check that JMeter machine(s) have enough resources (CPU, RAM, etc.)
Make sure to follow JMeter Best Practices
Consider going for Distributed Testing
Can you please check once CPU and Memory utilization(RAM and java heap utilization) of jmeter load generator while running jemter for 1000 users? If it is higher or reaching to max then it may affect requests/sec. Also just to confirm requests/sec from Jmeter side, can you please add listener in Jmeter script to track Hit/sec or TPS?
This will also be true(8.5K requests in 10 mins test duration) if your API response time is 1 second and also you have provided enough ramp-up time for those 1000 users.
So possible reason is:
You did not provide enough ramp-up time for 1000 users.
Your API average response time is more than 1 second while you performing tests for 1000 users.
Possible workarounds:
First, try to measure the API response time for 1 user.
Then calculate accordingly that how many users you need to reach 8500 requests in 10 mins. Use this formula:
TPS* max response time in second
Give proper ramp-up time for 1000 users. Check this thread to understand how you should calculate ramp-up time.
Check that your load generator is able to generate 1000 users without any memory or health (i.e CPU usage) issues. If requires, try to use distributed architecture.
For a rest api application in java, we are planning to perform a load test. But the initial results are a bit confusing. Post development of script using jmeter.
1. we execute the script for 1 vuser, 2 vusers, 5vuser, 10 vusers & 25vusers
2. Each test is executed for 30 minutes duration with nearly 5 sec rampup.
3. Each request has a random think time from 2 sec to 3 sec.
When this test is executed we see that for for few apis the 95%ile response time for 2, 5, 10 vuser is way less than 1 vuser. But same test post restart of tomcat gives different results
I am confused as to how the response time is decreasing as vusers are increasing.
Response time graphs, when tomcat instance is not restarted : https://imgur.com/a/bAqtHQm
Response time graphs, when tomcat instance is restarted :
https://imgur.com/a/KGhl4wS
There is one Java runtime feature: Just-in-time compilation, the Java bytecode gets translated into the native code after ~1500 invocations (default value), controllable via -XX:CompileThreshold property.
That could be the explanation for the situation you're facing: Java runtime optimizes the functions according to their usage hence function execution time might decrease if you repeatedly call it.
Also don't expect that response time for 2 virtual users will be 2x times higher than for 1 virtual user. The application might scale up to certain extent and when you increase the load the throughput will increase and the response time will remain the same.
At some point response time will start growing and throughput will go down and this is known as performance bottleneck, however the chance you will hit application limits with 25 users is minimal given current modern hardware.
So consider applying the following performance testing types:
Load testing: start with 1 user and gradually increase the load till the anticipated amount of virtual users at the same time looking at throughput and response time. If you will not detect performance degradation as number of users is growing - you can report that the application is ready for production usage
Stress testing: start with 1 user and gradually increase the load until response time starts growing or errors start occurring. This will provide you information like what is the maximum number of users it can support and what is the component which will fail first.
More information: Why ‘Normal’ Load Testing Isn’t Enough
Check that your API does not return 200 for invalid responses at scale.
Use ResponseAssertion for that.
I'm trying to create an app which can efficiently write data into Azure Table. In order to test storage performance, I created a simple console app, which sends hardcoded entities in a loop. Each entry is 0.1 kByte. Data is sent in batches (100 items in each batch, 10 kBytes each batch). For every batch, I prepare entries with the same partition key, which is generated by incrementing a global counter - so I never send more than one request to the same partition. Also, I control a degree of parallelism by increasing/decreasing the number of threads. Each thread sends batches synchronously (no request overlapping).
If I use 1 thread, I see 5 requests per second (5 batches, 500 entities). At that time Azure portal metrics shows table latency below 100ms - which is quite good.
If I increase the number of treads up to 12 I see x12 increase in outgoing requests. This rate stays stable for a few minutes. But then, for some reason I start being throttled - I see latency increase and requests amount drop.
Below you can see account metrics - highlighted point shows 2K31 transactions (batches) per minute. It is 3850 entries per second. If threads are increased up to 50, then latency increases up to 4 seconds, and transaction rate drops to 700 requests per second.
According to documentation, I should be able to send up to 20K transaction per second within one account (my test account is used only for my performance test). 20K batches mean 200K entries. So the question is why I'm being throttled after 3K entries?
Test details:
Azure Datacenter: West US 2.
My location: Los Angeles.
App is written in C#, uses CosmosDB.Table nuget with the following configuration: ServicePointManager.DefaultConnectionLimit = 250, Nagles Algorithm is disabled.
Host machine is quite powerful with 1Gb internet link (i7, 8 cores, no high CPU, no high memory is observed during the test).
PS: I've read docs
The system's ability to handle a sudden burst of traffic to a partition is limited by the scalability of a single partition server until the load balancing operation kicks-in and rebalances the partition key range.
and waited for 30 mins, but the situation didn't change.
EDIT
I got a comment that E2E Latency doesn't reflect server problem.
So below is a new graph which shows not only E2E latency but also the server's one. As you can see they are almost identical and that makes me think that the source of the problem is not on the client side.
We have application insights running on a web application deployed to a public interface on IIS. After approx. 1 hour of test usage (8-10 concurrent users) all requests enter a long running state (1-2 minutes) before performance returns to normal. This pattern happens at regular intervals during the day generally in line with usage.
Removing App Insights "fixes" the problem. Put Insights back in, problem re-occurs.
What we know:
PerfMon counters: CPU, Memory, Network Interface, Disk I/O, % maxConcurrentCalls, % maxConcurrentInstances show no spikes during the bottleneck
A firewall blocks the outbound App Insights call and the request is left to timeout
No IIS Events are raised (w3wp crash / app pool recycling) around the bottleneck time
WCF configuration is default
Given the frequency of the App Insight calls and the fact they are blocked and left to timeout I would expect the number of threads to max out and cause the bottleneck, but I would not expect it to return to normal after 1-2 minutes given the processing of the queued requests would equally trigger App Insight calls.
What is the counter here that shows the bottleneck and why does performance return to normal when it should stay choked.
I'm trying to stress-test my Spring RESTful Web Service.
I run my Tomcat server on a Intel Core 2 Duo notebook, 4 GB of RAM. I know it's not a real server machine, but i've only this and it's only for study purpose.
For the test, I run JMeter on a remote machine and communication is through a private WLAN with a central wireless router. I prefer to test this from wireless connection because it would be accessed from mobile clients. With JMeter i run a group of 50 threads, starting one thread per second, then after 50 seconds all threads are running. Each thread sends repeatedly an HTTP request to the server, containing a small JSON object to be processed, and sleeping on each iteration for an amount of time equals to the sum of a 100 milliseconds constant delay and a random value of gaussian distribution with standard deviation of 100 milliseconds. I use some JMeter plugins for graphs.
Here are the results:
I can't figure out why mi hits per seconds doesn't pass the 100 threshold (in the graph they are multiplied per 10), beacuse with this configuration it should have been higher than this value (50 thread sending at least three times would generate 150 hit/sec). I don't get any error message from server, and all seems to work well. I've tried even more and more configurations, but i can't get more than 100 hit/sec.
Why?
[EDIT] Many time I notice a substantial performance degradation from some point on without any visible cause: no error response messages on client, only ok http response messages, and all seems to work well on the server too, but looking at the reports:
As you can notice, something happens between 01:54 and 02:14: hits per sec decreases, and response time increase, okay it could be a server overload, but what about the cpu decreasing? This is not compatible with the congestion hypothesis.
I want to notice that you've chosen very well which rows to display on Composite Graph. It's enough to make some conclusions:
Make note that Hits Per Second perfectly correlates with CPU usage. This means you have "CPU-bound" system, and the maximum performance is mostly limited by CPU. This is very important to remember: server resources spent by Hits, not active users. You may disable your sleep timers at all and still will receive the same 80-90 Hits/s.
The maximum level of CPU is somewhere at 80%, so I assume you run Windows OS (Win7?) on your machine. I used to see that it's impossible to achieve 100% CPU utilization on Windows machine, it just does not allow to spend the last 20%. And if you achieved the maximum, then you see your installation's capacity limit. It just has not enough CPU resources to serve more requests. To fight this bottleneck you should either give more CPU (use another server with higher level CPU hardware), or configure OS to let you use up to 100% (I don't know if it is applicable), or optimize your system (code, OS settings) to spend less CPU to serve single request.
For the second graph I'd suppose something is downloaded via the router, or something happens on JMeter machine. "Something happens" means some task is running. This may be your friend who just wanted to do some "grep error.log", or some scheduled task is running. To pin this down you should look at the router resources and jmeter machine resources at the degradation situation. There must be a process that swallows CPU/DISK/Network.