Play2 performance surprises - performance

Some introduction first.
Our Play2 (ver. 2.5.10) web service provides responses in JSON, and the response size can be relatively large: up to 350KB.
We've been running our web service in a standalone mode for a while. We had gzip compression enabled as Play2, which reduces response body ~10 times, i.e. we had up to ~35KB response bodies.
In that mode the service can handle to up to 200 queries per second while running on AWS EC2 m4.xlarge (4 vCPU, 16GB RAM, 750 MBit network) inside Docker container. The performance is completely CPU-bound, most of the time (75%) is spent on JSON serialization, and rest of the time (25%) was spent on gzip compression. Business logic is very fast and is not even visible on the perf graphs.
Now we have introduced a separate front-end node (running Nginx) to handle some specific functions: authentication, authorization and, crucially to this question, traffic compression. We hoped to offload compression from the Play2 back-end to the front-end and spend those 25% of CPU cycles on main tasks.
However, instead of improving, performance got much worse! Now our web service can only handle up to 80 QPS. At that point, most of CPU is already consumed by something inside the JVM. Our metrics show it's not garbage collection, and it's also not in our code, but rather something inside Play.
It's important to notice that as this load (80 QPS at ~350KB per response) we generate ~30 MB/s of traffic. This number, while significant, doesn't saturate the EC2 networking, so that shouldn't be the bottleneck.
So my question, I guess, is as follows: does anyone have an explanation and a mitigation plan for this problem? Some hints about how to get to the root cause of this would also be helpful.

Related

What methodology would you use to measure the load capacity of a software server application?

I have a high-performance software server application that is expected to get increased traffic in the next few months.
I was wondering what approach or methodology is good to use in order to gauge if the server still has the capacity to handle this increased load?
I think you're looking for Stress Testing and the scenario would be something like:
Create a load test simulating current real application usage
Start with current number of users and gradually increase the load until
you reach the "increased traffic" amount
or errors start occurring
or you start observing performance degradation
whatever comes the first
Depending on the outcome you either can state that your server can handle the increased load without any issues or you will come up with the saturation point and the first bottleneck
You might also want to execute a Soak Test - leave the system under high prolonged load for several hours or days, this way you can detect memory leaks or other capacity problems.
More information: Why ‘Normal’ Load Testing Isn’t Enough
Test the product with one-tenth the data and traffic. Be sure the activity is 'realistic'.
Then consider what will happen as traffic grows -- with the RAM, disk, cpu, network, etc, grow linearly or not?
While you are doing that, look for "hot spots". Optimize them.
Will you be using web pages? Databases? Etc. Each of these things scales differently. (In other words, you have not provided enough details in your question.)
Most canned benchmarks focus on one small aspect of computing; applying the results to a specific application is iffy.
I would start by collecting base line data on critical resources - typically, CPU, memory usage, disk usage, network usage - and track them over time. If any of those resources show regular spikes where they remain at 100% capacity for more than a fraction of a second, under current usage, you have a bottleneck somewhere. In this case, you cannot accept additional load without likely outages.
Next, I'd start figuring out what the bottleneck resource for your application is - it varies between applications, but in most cases it's the bottleneck resource that stops you from scaling further. Your CPU might be almost idle, but you're thrashing the disk I/O, for instance. That's a tricky process - load and stress testing are the way to go.
If you can resolve the bottleneck by buying better hardware, do so - it's much cheaper than rewriting the software. If you can't buy better hardware, look at load balancing. If you can't load balance, you've got to look at application architecture and implementation and see if there are ways to move the bottleneck.
It's quite typical for the bottleneck to move from one resource to the next - you've got CPU to behave, but now when you increase traffic, you're spiking disk I/O; once you resolve that, you may get another CPU challenge.

Can bloated CacheStorage slow down Service Worker responses?

I am caching opaque responses in SW which bloats CacheStorage exponentially upto order of 6 GB+. At times, I see SW responses slower than that from the browser cache.
Can bloated CacheStorage lead to slow read and hence degraded performance while serving request through SW? How much of this performance is driven by what kind of hard drive the machine is loaded with -- SSD vs HDD?
PS: I am aware that the ideal fix is to either fix the opaque responses or not cache them at all.
The Cache Storage API gives you programmatic control over cache expiration, and allows you to use cached responses within JavaScript to construct complex serving/fallback strategies that work independent of the network, which is generally speaking not possible when just using the browser's HTTP cache.
There is no specific expectation that using the Cache Storage API is going to be faster than the HTTP cache, though.
There's some degree of overhead involved whenever service worker code is run, and the details about that overhead's impact are likely to vary based on storage medium, CPU, browser version, and any number of other criteria. I will say that I don't believe the fact that these are opaque responses comes that heavily into play when it comes to runtime performance, and the additional quota usage is effectively "fictitious". It just translates into a higher number used when calculating available quota, but does not actually result in more data being written to disk.

Scrapy spiders drastically slows down while running on AWS EC2

I am using scrapy to scrape multiple sites and Scrapyd to run spiders.
I had written 7 spiders and each spider processes at least 50 start URLs. I have around 7000 URL's. 1000 URL's for each spider.
As I start placing jobs in ScrapyD with 50 start URL's per job. Initially all spiders responds fine but suddenly they start working really slow. While running those on localhost it gives high performance.
While I run Scrapyd on localhost it gives me very high performance. As I publish jobs on Scrapyd server. Request response time drastically decreases.
Response time for each start URL is really slow after some time on server
Settings looks like this:
BOT_NAME = 'service_scraper'
SPIDER_MODULES = ['service_scraper.spiders']
NEWSPIDER_MODULE = 'service_scraper.spiders'
CONCURRENT_REQUESTS = 30
# DOWNLOAD_DELAY = 0
CONCURRENT_REQUESTS_PER_DOMAIN = 1000
ITEM_PIPELINES = {
'service_scraper.pipelines.MongoInsert': 300,
}
MONGO_URL="mongodb://xxxxx:yyyy"
EXTENSIONS = {'scrapy.contrib.feedexport.FeedExporter': None}
HTTPCACHE_ENABLED = True
We tried changing CONCURRENT_REQUESTS and CONCURRENT_REQUESTS_PER_DOMAIN, but nothing is working. We had hosted scrapyd in AWS EC2.
As with all performance testing, the goal is to find the performance bottleneck. This typically falls to one (or more) of:
Memory: Use top to measure memory consumption. If too much memory is consumed, it might swap to disk, which is slower than RAM. Try adding memory.
CPU: Use Amazon CloudWatch to track CPU. Be very careful with t2 instances (see below).
Disk speed: If the job is disk-intensive, or if memory is swapping to disk, this can impact performance -- especially for databases. Amazon EBS is network-attached disk, so network speed can actually throttle disk speed.
Network speed: Due to the multi-tenant design of Amazon EC2, network bandwidth is intentionally throttled. The amount of network bandwidth available depends upon the instance type used.
You are using a t2.small instance. It has:
Memory: 2GB (This is less than the 4GB on your own laptop)
CPU: The t2 family is extremely powerful, but the t2.small only receives an average 20% of CPU (see below).
Network: The t2.small is rated as Low to Moderate network bandwidth.
The fact that your CPU is recording 60%, while the t2.small is limited to an average 20% of CPU indicates that the instance is consuming CPU credits faster than they are being earned. This leads to an eventual exhaustion of CPU credits, thereby limiting the machine to 20% of CPU. This is highly likely to be impacting your performance. You can view CPU Credit balances in Amazon CloudWatch.
See: T2 Instances documentation for an understanding of CPU Credits.
Network bandwidth is relatively low for the t2.small. This impacts Internet access and communication with the Amazon EBS storage volume. Given that your application is downloading lots of web pages in parallel, and then writing them to disk, this is also a potential bottleneck for your system.
Bottom line: When comparing to the performance on your laptop, the instance in use has less memory, potentially less CPU due to exhaustion of CPU credits and potentially slower disk access due to high network traffic.
I recommend you use a larger instance type to confirm that performance is improved, then experiment with different instance types (both in the t2 family and outside of it) to determine what size machine gives you the best price/performance trade-off.
Continue to monitor the CPU, Memory and Network performance to identify the leading bottleneck, then aim to fix that bottleneck.

Low throughput of spray application

I am working on spray application that is expected to work as a part of real time bidder so both latency and throughput are very important (~80 ms - maximal latency, N*10000 rps are required). But now I experience a very low throughput (less than 1000 rps) on AWS machine of c4.2xlarge type (8 cores CPU, 16GB memory, ulimit - 500000). As I can see from jvm and system monitoring, it is far from being limited by CPU, memory, disk IO or networking.
The system contains mainly two pools of actors on request critical path and I don't see that throughput is increased when I increase pools (when pools are increased, I mostly see increased latency). So I think the problem can be in context switching. So I'd be glad to see advises regarding actor dispatchers configuration.
Some more details about the application architecture.
The system contains two sets of actors on request critical path: pool of ServerActor (extends spray HttpService) and pool of BiddingAgentActor. ServerActor just sends message with ask pattern (route is handled with its future) to BidderAgentActor. BidderAgentActor handles request mostly synchronously. It uses Futures for work concurrently when possible, but waits for result of final future. No external calls are made in this process (no DB queries, no HTTP calls...). There are several more actors in the application and BiddingAgentActor sends to them messages for offline processing.
Updated
Here is a link to a bit simplified code for both BidderAgentActor pool and for BidderAgentActor created per request: gist
Versions:
scala - 2.11.8
akka - 2.3.12
spray - 1.3.1
OS: ubuntu 14.04

What are the theoretical performance limits on web servers?

In a currently deployed web server, what are the typical limits on its performance?
I believe a meaningful answer would be one of 100, 1,000, 10,000, 100,000 or 1,000,000 requests/second, but which is true today? Which was true 5 years ago? Which might we expect in 5 years? (ie, how do trends in bandwidth, disk performance, CPU performance, etc. impact the answer)
If it is material, the fact that HTTP over TCP is the access protocol should be considered. OS, server language, and filesystem effects should be assumed to be best-of-breed.
Assume that the disk contains many small unique files that are statically served. I'm intending to eliminate the effect of memory caches, and that CPU time is mainly used to assemble the network/protocol information. These assumptions are intended to bias the answer towards 'worst case' estimates where a request requires some bandwidth, some cpu time and a disk access.
I'm only looking for something accurate to an order of magnitude or so.
Read http://www.kegel.com/c10k.html. You might also read StackOverflow questions tagged 'c10k'. C10K stands for 10'000 simultaneous clients.
Long story short -- principally, the limit is neither bandwidth, nor CPU. It's concurrency.
Six years ago, I saw an 8-proc Windows Server 2003 box serve 100,000 requests per second for static content. That box had 8 Gigabit Ethernet cards, each on a separate subnet. The limiting factor there was network bandwidth. There's no way you could serve that much content over the Internet, even with a truly enormous pipe.
In practice, for purely static content, even a modest box can saturate a network connection.
For dynamic content, there's no easy answer. It could be CPU utilization, disk I/O, backend database latency, not enough worker threads, too much context switching, ...
You have to measure your application to find out where your bottlenecks lie. It might be in the framework, it might be in your application logic. It probably changes as your workload changes.
I think it really depends on what you are serving.
If you're serving web applications that dynamically render html, CPU is what is consumed most.
If you are serving up a relatively small number of static items lots and lots of times, you'll probably run into bandwidth issues (since the static files themselves will probably find themselves in memory)
If you're serving up a large number of static items, you may run into disk limits first (seeking and reading files)
If you are not able to cache your files in memory, then disk seek times will likely be the limiting factor and limit your performance to less than 1000 requests/second. This might improve when using solid state disks.
100, 1,000, 10,000, 100,000 or 1,000,000 requests/second, but which is true today?
This test was done on a modest i3 laptop, but it reviewed Varnish, ATS (Apache Traffic Server), Nginx, Lighttpd, etc.
http://nbonvin.wordpress.com/2011/03/24/serving-small-static-files-which-server-to-use/
The interesting point is that using a high-end 8-core server gives a very little boost to most of them (Apache, Cherokee, Litespeed, Lighttpd, Nginx, G-WAN):
http://www.rootusers.com/web-server-performance-benchmark/
As the tests were done on localhost to avoid hitting the network as a bottleneck, the problem is in the kernel which does not scale - unless you tune its options.
So, to answer your question, the progress margin is in the way servers process IO.
They will have to use better data structures (wait-free).
I think there are too many variables here to answer your question.
What processor, what speed, what cache, what chipset, what disk interface, what spindle speed, what network card, how configured, the list is huge. I think you need to approach the problem from the other side...
"This is what I want to do and achieve, what do I need to do it?"
OS, server language, and filesystem effects are the variables here. If you take them out, then you're left with a no-overhead TCP socket.
At that point it's not really a question of performance of the server, but of the network. With a no-overhead TCP socket your limit that you will hit will most likely be at the firewall or your network switches with how many connections can be handled concurrently.
In any web application that uses a database you also open up a whole new range of optimisation needs.
indexes, query optimisation etc
For static files, does your application cache them in memory?
etc, etc, etc
This will depend what is your CPU core
What speed are your disks
What is a 'fat' 'medium' sized hosting companies pipe.
What is the web server?
The question is too general
Deploy you server test it using tools like http://jmeter.apache.org/ and see how you get on.

Resources