We still need the old WebSphere based Kudos Boards from ISW. Users reported that ajax operations like setting due dates or responsible people in a card are very slow, some even abort before finished. Until now I can't see a pattern here why this occurs. It doesn't seem related to a specific user or client.
How could this be troubleshooted? I'd like to take one step back and measure first, how long it takes to process requests. Then generating a list of all requests that took more than a specific time to process, lets say more than one second. So we know how many users are affected from this and there are maybe some shared things between them, like users on a slow network or something like that.
Collecting request duration logs from IHS
The Apache module mod_log_config allows certain properties to be logged. Interesting is %T:
The time taken to serve the request, in seconds.
Using the %{UNIT}T format, it could be displayed in microseconds instead of seconds, which is better for analyzing later. So I extended the common log format to write the processing time first in conf/httpd.conf:
LogFormat "%{ms}T %h %l %u %t \"%r\" %>s %b" common
Another good idea is to add the User-Agent too, which gave some information about the type of client:
LogFormat "%{ms}T %h %l %u %t \"%r\" %>s %b \"%{User-Agent}i\"" common
First check the configuration file to make sure we didn't break anything:
/opt/IBM/HTTPServer #./bin/apachectl -t
Syntax OK
Now we can do a graceful reastart on productive environments. This way, the webserver finishes all currently processed requests before applying our new config. In opposite to a hard restart, most of the users shouldn't even know that we restarted their webserver.
./bin/apachectl -k graceful
All new requests are logged with the duration in ms like this:
100007 1.2.3.4 - - [03/Nov/2020:14:29:52 +0100] "POST /push/form/comet/connect HTTP/1.1" 200 96 "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0"
Analyzing the measured data
We now have several hundred MB of logs from a single day. A filter is required to get usefull information from them. When looking in the logs, we need to do two things to get information about slow requests from them:
Filter on request duration: Requests which are faster than XXX ms need to be sorted out
HCL Connections and Kudos Board use some very old workarounds: They keep ajax requests open to get informed about server updates. A old way of websockets. Those requests just spam our slowlog with false-positives.
I wrote the following command for this purpose:
log=/tmp/access-slow.log; egrep -i -v "(POST /(push|files)/|GET /(kudosboards/updates|api-boards)/|\.mp4(\?[A-Za-z0-9_\-=&]+)? HTTP)" logs/access_2020_11_03.log | egrep '^[0-9]{4,}' | sort -rnk1 > $log; wc -l $log
The first egrep call filters out all requests which are not interesting here: /push is from the long keep alive connections, kudos do this as well. I also excluded mp4 movies because those files are naturally large. When downloading a 500MB movie take some time, it's nothing we need to worry about. Sometimes I also saw them with GET parameters like movie.mp4?logDownload=true&downloadType=view so those were filtered as well.
egrep '^[0-9]{4,}' sort out requests which have a processing time less than 4 digits, wich basically mean >= 1000ms. Since I'm looking for heavy performance issues, this seems a good point to start.
sort -rnk orders our logs by duration descending, so we have the slowest entries at the beginning.
wc -l $log is just to know how much slow entries we collected this time. Now they could be viewed with less $log. Our large access log with over 2 million entries was reduced to 2k.
Related
We are currently conducting performance tests on both web apps that we have, one is running within a private network and the other is accessible for all. For both apps, a single page-load of the landing page or initial page only takes between 2-3 seconds on a user POV, but when we use blaze and JMeter, the results are between 15-20 seconds. Am I missing something? The 15-20 seconds result came from the Loadtime/Sample Time in JMeter and in Elapsed column if extracted to .csv. Please help as I'm stuck.
We have tried conducting tests on multiple PCs within the office premises along with a PC remotely accessed on another site and we still get the same results. The number of thread and ramp-up period is both set to 1 to imitate a single user only.
Where a delta exists, it is certain to mean that two different items are being timed. It would help to understand on your front end are you timing to a standard metric, such as w3c domComplete, time to interactive, first contentful paint, some other location, and then compare where this comes into play on the drilldown on the performance tab of chrome. Odds are that there is a lot occuring that is not visible that is being captured by Jmeter.
You might also look for other threads on here on how jmeter operates as compared to a "real browser" There are differences which could come into play affecting your page comparisons, particularly if you have dozens/hundreds of elements that need to be downloaded to complete your page. Also, pay attention to third party components where you do not have permission to test their servers.
I can think of 2 possible causees:
Clear your browser history, especially browser cache. It might be the case you're getting HTTP Status 304 for all requests in browser because responses are being returned from the browser cache and no actual requests are being made while JMeter always uses "clean" session.
Pay attention to Connect Time and Latency metrics as it might be the case the server response time is low but the time for network packets to travel back and forth is very high.
Connect Time. JMeter measures the time it took to establish the connection, including SSL handshake. Note that connect time is not automatically subtracted from latency. In case of connection error, the metric will be equal to the time it took to face the error, for example in case of Timeout, it should be equal to connection timeout.
Latency. JMeter measures the latency from just before sending the request to just after the first response has been received. Thus the time includes all the processing needed to assemble the request as well as assembling the first part of the response, which in general will be longer than one byte. Protocol analysers (such as Wireshark) measure the time when bytes are actually sent/received over the interface. The JMeter time should be closer to that which is experienced by a browser or other application client.
So basically "Elapsed time = Connect Time + Latency + Server Processing Time"
In general given:
the same machine
clean browser session
and JMeter configured to behave like a real browser
you should get similar or equal timings for the same page
I'm trying to stress test my Spring Boot application, but when I run the following command, what ab is doing is that trying to give out a result the the maximum my application could holds. But what I need is to check whether my application could hold at a specific request per second.
ab -p req.json -T application/json -k -c 1000 -n 500000 http://myapp.com/customerTrack/v1/send
The request per second given from above command is 4000, but actually, a lot of records are buffered in my application which means it can't hold that much rps. Could anyone tell me how to set a specific request per second in ab tools? Thanks!
I don't think you can get what you want from ab. There are a lot of other tools out there.
Here's a simple one that might do exactly what you want.
https://github.com/rakyll/hey
For rate limiting to 100 requests per second the below command should work.
hey -D req.json -T application/json -c 1000 -q 100 -n 500000 http://myapp.com/customerTrack/v1/send
Apache Bench is single threaded program that can only take advantage of one processor on your client’s machine. In extreme conditions, the tool could misrepresent results if the parameters of your test exceed the capabilities of the environment the tool is running in. Accorading to your description, the rps has already reach your hardware limitation.
A lot of records are buffered in my application which means it can't hold that much rps
It is very hard to control request per second in single machine.
You can find better performacne testing tools from here HTTP(S) Benchmark Tools
If you have budget you can try goad, which is an AWS Lambda powered, highly distributed, load testing tool built in Go for the 2016 Gopher Gala. Goad allows you to load test your websites from all over the world whilst costing you the tiniest fractions of a penny by using AWS Lambda in multiple regions simultaneously.
We are seeing inconsistent performance on Heroku that is unrelated to the recent unicorn/intelligent routing issue.
This is an example of a request which normally takes ~150ms (and 19 out of 20 times that is how long it takes). You can see that on this request it took about 4 seconds, or between 1 and 2 orders of magnitude longer.
Some things to note:
the database was not the bottleneck, and it spent only 25ms doing db queries
we have more than sufficient dynos, so I don't think this was the bottleneck (20 double dynos running unicorn with 5 workers each, we get only 1000 requests per minute, avg response time of 150ms, which means we should be able to serve (60 / 0.150) * 20 * 5 = 40,000 requests per minute. In other words we had 40x the capacity on dynos when this measurement was taken.
So I'm wondering what could cause these occasional slow requests. As I mentioned, anecdotally it seems to happen in about 1 in 20 requests. The only thing I can think of is there is a noisy neighbor problem on the boxes, or the routing layer has inconsistent performance. If anyone has additional info or ideas I would be curious. Thank you.
I have been chasing a similar problem myself, with not much luck so far.
I suppose the first order of business would to be to recommend NewRelic. It may have some more info for you on these cases.
Second, I suggest you look at queue times: how long your request was queued. Look at NewRelic for this, or do it yourself with the "start time" HTTP header that Heroku adds to your incoming request (just print now() minus "start time" as your queue time).
When those failed me in my case, I tried coming up with things that could go wrong, and here's a (unorthodox? weird?) list:
1) DNS -- are you making any DNS calls in your view? These can take a while. Even DNS requests for resolving DB host names, Redis host names, external service providers, etc.
2) Log performance -- Heroku collects all your stdout using their "Logplex", which it then drains to your own defined logdrains, services such as Papertrail, etc. There is no documentation on the performance of this, and writes to stdout from your process could block, theoretically, for periods while Heroku is flushing any buffers it might have there.
3) Getting a DB connection -- not sure which framework you are using, but maybe you have a connection pool that you are getting DB connections from, and that took time? It won't show up as query time, it'll be blocking time for your process.
4) Dyno performance -- Heroku has an add-on feature that will print, every few seconds, some server metrics (load avg, memory) to stdout. I used Graphite to graph those and look for correlation between the metrics and times where I saw increased instances of "sporadic slow requests". It didn't help me, but might help you :)
Do let us know what you come up with.
I am using to test my web server https://buyandbrag.in .
I have tested it for 100 users. But the main server is not showing like it is crowded or not.
I want to know whether it is really pressuring the main server(a cloud server I am using).Or just use the client resourse where the tool is installed.
Yes as mentioned you should be monitoring both servers to see how they handle the load. The simplest way to do this is with TOP (if your server OS is *NIX) also you should be watching the network activity i.e. Bandwidth, connection status (time wait, close wait and so on).
Also if your using apache keep an eye on the logs you should see the requests being logged there
Good luck with the tests
I want to know "how many users my website can handele ?",when I tested with 50 threads ,the cpu usage of my server increased but not the connections log(It showed just 2 connections).also the bandwidth usage is not that much
Firstly what connections are you referring to? Apache, DB etc?
Secondly if you want to see how many users your current setup can hand you need to create a profile or traffic model of what an average user will do on your site.
For example:
Say 90% of the time they will search for something
5% of the time they will purchase x
5% of the time they login.
Once you have your "Traffic Model" defined, implement it in jMeter then start increasing your load in increments i.e. running your load test for 10mins with x users, after 10mins increment that number and so on until you find your breaking point.
If you graph your responses you should see two main things:
1) The optimum response time / number of users before the service degrades
2) The tipping point i.e. at what point you start returning 503's etc
Now you'll have enough data to scale your site or to start making performance improvements from a code point of view.
I have the exact same html sitting on two different servers. Both pages call things like stylesheets and images from the same servers (not each from their local server). In other words, these pages are identical except they exist on two different servers. It's all static html. The only DNS lookups are for images.
On one server it takes 25 seconds to load, and it appears most of that is waiting on the html page itself
http://tools.pingdom.com/fpt/#!/CmGSycTZd/http://205.158.110.184/contents/mylayout/2
On another server it takes under 2 seconds to load
http://tools.pingdom.com/fpt/#!/rqg73fi7V/http://socialmediaphyte.com/TEST/image-dns-testing-ImageON.html
The only difference I can ID from Pingdom is "Connection." The slow server responds with "close" and the fast server responds with "Keep-Alive". Is THAT the most likely issue? Or is it possibly something else? (And if you know the remedy for your suspected cause, that would be wonderful.)
Thanks!
Not using keep-alive will slow the overall load time a bit because you incur the additional overhead of having to establish a new connection for each resource, rather than re-using one or more connections. This shouldn't equate to 23 seconds difference though.
Using the FireBug Net Panel for Firefox can be of great assistance in seeing what is taking so long. It shows you how long each requested resource from the page took to load, and how long each phase of requesting the resource took.
Other factors could include one server is using gzip compression on pages and the other is not, or it could just be overloaded.