Some introduction first.
Our Play2 (ver. 2.5.10) web service provides responses in JSON, and the response size can be relatively large: up to 350KB.
We've been running our web service in a standalone mode for a while. We had gzip compression enabled as Play2, which reduces response body ~10 times, i.e. we had up to ~35KB response bodies.
In that mode the service can handle to up to 200 queries per second while running on AWS EC2 m4.xlarge (4 vCPU, 16GB RAM, 750 MBit network) inside Docker container. The performance is completely CPU-bound, most of the time (75%) is spent on JSON serialization, and rest of the time (25%) was spent on gzip compression. Business logic is very fast and is not even visible on the perf graphs.
Now we have introduced a separate front-end node (running Nginx) to handle some specific functions: authentication, authorization and, crucially to this question, traffic compression. We hoped to offload compression from the Play2 back-end to the front-end and spend those 25% of CPU cycles on main tasks.
However, instead of improving, performance got much worse! Now our web service can only handle up to 80 QPS. At that point, most of CPU is already consumed by something inside the JVM. Our metrics show it's not garbage collection, and it's also not in our code, but rather something inside Play.
It's important to notice that as this load (80 QPS at ~350KB per response) we generate ~30 MB/s of traffic. This number, while significant, doesn't saturate the EC2 networking, so that shouldn't be the bottleneck.
So my question, I guess, is as follows: does anyone have an explanation and a mitigation plan for this problem? Some hints about how to get to the root cause of this would also be helpful.
I would like to deploy SSL on a MQ server but I would like to know if my current CPU capacity will support SSL. (I do not have budget to increase the number of CPU cores and MQ PVU)
My specs:
Windows 2003 Server SP2,
1 core of Intel Xeon CPU E5-2690 2.9GHz,
2 GB RAM,
1 Qmgr,
Linear Logging,
Persistants messages,
DQM with 5 parteners,
10 senders channels,
10 receivers channels
For a month:
we exchange in average 3 million messages with our partners on a total of 15Gbytes of data.
(so 5K per message).
we had in average variations of CPU between 20% and 40%
we had 4 peaks of 100% CPU
Do you think my system can cope SSL with Cipher RC4_MD5_EXPORT ?
Best Regards,
Pascal
I don't think it's possible to provide a definitive answer as to whether your server can cope with enabling SSL using the RC4_MD5_EXPORT cipher on your MQ channels short of trying it and assessing the impact. You may also want to take a look at the processor queue length using the Windows Performance Monitor tool to see how many processes are waiting for CPU time when the usage increases.
As your CPU provides hardware support for the AES encryption algorithm you may want to consider using one of the AES CipherSpecs instead. This also has the advantage of providing better security as both MD5 and RC4 are fairly weak in terms of hashing and encryption.
One option to consider is installing an SSL acceleration card in your server to allow the messages to be encrypted/hashed using dedicated hardware rather than your server's CPU. This page http://www-01.ibm.com/support/knowledgecenter/#!/SSFKSJ_7.5.0/com.ibm.mq.ref.doc/q049300_.htm on the IBM Knowledge Center provide some further information and lists which cards are supported by WebSphere MQ.
What is the browser's overhead to decompress a gzip server response of an average sized web page?
<1ms 1-3ms? more?
I'll assume that you mean 1.3M uncompressed. I get about 6 ms decompression time on one core of a 2 GHz i7.
If I assume 1/3 compression, an extra 7 Mbits needs to be transferred if not compressed. That will take more than 6 ms on a 1 Gbit/s link. 700 ms on a more typical 10 Mbit/s link.
gzip is a big win for HTTP transfers.
Using zlib implementation of gzip with default parameters.
On an internet facing server, Xeon cpu 2.66Ghz quad core, the gzip compression times are
Less than 0.5mS up to 15Kb. 361Kb is 4.50mS and 1077Kb takes 13mS
I consider this still easily worth it however, as most of our traffic is heading out over wifi or 3G links, so transfer time far outweighs server delay.
The times are measured with code bracketing only the call to gzip routines and use nS precision timers, I changed the source to implement this. I was measuring this anyway, as I was trying to determine if caching gzip was worth the memory tradeoff, or was gzip fast enough anyway. In our case, I think we will gzip everything above about 200bytes, and aggresively cache gzip'd responses, especially for larger packets.
(#Mark adler, thanks for writing zlib)
In a currently deployed web server, what are the typical limits on its performance?
I believe a meaningful answer would be one of 100, 1,000, 10,000, 100,000 or 1,000,000 requests/second, but which is true today? Which was true 5 years ago? Which might we expect in 5 years? (ie, how do trends in bandwidth, disk performance, CPU performance, etc. impact the answer)
If it is material, the fact that HTTP over TCP is the access protocol should be considered. OS, server language, and filesystem effects should be assumed to be best-of-breed.
Assume that the disk contains many small unique files that are statically served. I'm intending to eliminate the effect of memory caches, and that CPU time is mainly used to assemble the network/protocol information. These assumptions are intended to bias the answer towards 'worst case' estimates where a request requires some bandwidth, some cpu time and a disk access.
I'm only looking for something accurate to an order of magnitude or so.
Read http://www.kegel.com/c10k.html. You might also read StackOverflow questions tagged 'c10k'. C10K stands for 10'000 simultaneous clients.
Long story short -- principally, the limit is neither bandwidth, nor CPU. It's concurrency.
Six years ago, I saw an 8-proc Windows Server 2003 box serve 100,000 requests per second for static content. That box had 8 Gigabit Ethernet cards, each on a separate subnet. The limiting factor there was network bandwidth. There's no way you could serve that much content over the Internet, even with a truly enormous pipe.
In practice, for purely static content, even a modest box can saturate a network connection.
For dynamic content, there's no easy answer. It could be CPU utilization, disk I/O, backend database latency, not enough worker threads, too much context switching, ...
You have to measure your application to find out where your bottlenecks lie. It might be in the framework, it might be in your application logic. It probably changes as your workload changes.
I think it really depends on what you are serving.
If you're serving web applications that dynamically render html, CPU is what is consumed most.
If you are serving up a relatively small number of static items lots and lots of times, you'll probably run into bandwidth issues (since the static files themselves will probably find themselves in memory)
If you're serving up a large number of static items, you may run into disk limits first (seeking and reading files)
If you are not able to cache your files in memory, then disk seek times will likely be the limiting factor and limit your performance to less than 1000 requests/second. This might improve when using solid state disks.
100, 1,000, 10,000, 100,000 or 1,000,000 requests/second, but which is true today?
This test was done on a modest i3 laptop, but it reviewed Varnish, ATS (Apache Traffic Server), Nginx, Lighttpd, etc.
http://nbonvin.wordpress.com/2011/03/24/serving-small-static-files-which-server-to-use/
The interesting point is that using a high-end 8-core server gives a very little boost to most of them (Apache, Cherokee, Litespeed, Lighttpd, Nginx, G-WAN):
http://www.rootusers.com/web-server-performance-benchmark/
As the tests were done on localhost to avoid hitting the network as a bottleneck, the problem is in the kernel which does not scale - unless you tune its options.
So, to answer your question, the progress margin is in the way servers process IO.
They will have to use better data structures (wait-free).
I think there are too many variables here to answer your question.
What processor, what speed, what cache, what chipset, what disk interface, what spindle speed, what network card, how configured, the list is huge. I think you need to approach the problem from the other side...
"This is what I want to do and achieve, what do I need to do it?"
OS, server language, and filesystem effects are the variables here. If you take them out, then you're left with a no-overhead TCP socket.
At that point it's not really a question of performance of the server, but of the network. With a no-overhead TCP socket your limit that you will hit will most likely be at the firewall or your network switches with how many connections can be handled concurrently.
In any web application that uses a database you also open up a whole new range of optimisation needs.
indexes, query optimisation etc
For static files, does your application cache them in memory?
etc, etc, etc
This will depend what is your CPU core
What speed are your disks
What is a 'fat' 'medium' sized hosting companies pipe.
What is the web server?
The question is too general
Deploy you server test it using tools like http://jmeter.apache.org/ and see how you get on.
I know there's no single hard-and-fast answer, but is there a generic order-of-magnitude estimate approximation for the encryption overhead of SSL versus unencrypted socket communication? I'm talking only about the comm processing and wire time, not counting application-level processing.
Update
There is a question about HTTPS versus HTTP, but I'm interested in looking lower in the stack.
(I replaced the phrase "order of magnitude" to avoid confusion; I was using it as informal jargon rather than in the formal CompSci sense. Of course if I had meant it formally, as a true geek I would have been thinking binary rather than decimal! ;-)
Update
Per request in comment, assume we're talking about good-sized messages (range of 1k-10k) over persistent connections. So connection set-up and packet overhead are not significant issues.
Order of magnitude: zero.
In other words, you won't see your throughput cut in half, or anything like it, when you add TLS. Answers to the "duplicate" question focus heavily on application performance, and how that compares to SSL overhead. This question specifically excludes application processing, and seeks to compare non-SSL to SSL only. While it makes sense to take a global view of performance when optimizing, that is not what this question is asking.
The main overhead of SSL is the handshake. That's where the expensive asymmetric cryptography happens. After negotiation, relatively efficient symmetric ciphers are used. That's why it can be very helpful to enable SSL sessions for your HTTPS service, where many connections are made. For a long-lived connection, this "end-effect" isn't as significant, and sessions aren't as useful.
Here's an interesting anecdote. When Google switched Gmail to use HTTPS, no additional resources were required; no network hardware, no new hosts. It only increased CPU load by about 1%.
I second #erickson: The pure data-transfer speed penalty is negligible. Modern CPUs reach a crypto/AES throughput of several hundred MBit/s. So unless you are on resource constrained system (mobile phone) TLS/SSL is fast enough for slinging data around.
But keep in mind that encryption makes caching and load balancing much harder. This might result in a huge performance penalty.
But connection setup is really a show stopper for many application. On low bandwidth, high packet loss, high latency connections (mobile device in the countryside) the additional roundtrips required by TLS might render something slow into something unusable.
For example we had to drop the encryption requirement for access to some of our internal web apps - they where next to unusable if used from china.
Assuming you don't count connection set-up (as you indicated in your update), it strongly depends on the cipher chosen. Network overhead (in terms of bandwidth) will be negligible. CPU overhead will be dominated by cryptography. On my mobile Core i5, I can encrypt around 250 MB per second with RC4 on a single core. (RC4 is what you should choose for maximum performance.) AES is slower, providing "only" around 50 MB/s. So, if you choose correct ciphers, you won't manage to keep a single current core busy with the crypto overhead even if you have a fully utilized 1 Gbit line. [Edit: RC4 should not be used because it is no longer secure. However, AES hardware support is now present in many CPUs, which makes AES encryption really fast on such platforms.]
Connection establishment, however, is different. Depending on the implementation (e.g. support for TLS false start), it will add round-trips, which can cause noticable delays. Additionally, expensive crypto takes place on the first connection establishment (above-mentioned CPU could only accept 14 connections per core per second if you foolishly used 4096-bit keys and 100 if you use 2048-bit keys). On subsequent connections, previous sessions are often reused, avoiding the expensive crypto.
So, to summarize:
Transfer on established connection:
Delay: nearly none
CPU: negligible
Bandwidth: negligible
First connection establishment:
Delay: additional round-trips
Bandwidth: several kilobytes (certificates)
CPU on client: medium
CPU on server: high
Subsequent connection establishments:
Delay: additional round-trip (not sure if one or multiple, may be implementation-dependant)
Bandwidth: negligible
CPU: nearly none