OpenSSL: SSL_write abnormal latency - performance

I use OpenSSL along with a TCP socket. Non-blocking (O_NONBLOCK) mode is set by fcntl and Nagle's algorithm is disabled by TCP_NODELAY.
I experience a huge (~45ms) latency between SSL_write method returns and data actually captured by tcpdump. The length of my message is ~1100 bytes.
My understanding is that if the underlying BIO is a socket, then no-buffering should be involved. Am I wrong? What may cause such a poor performance of a method?

Related

How do tools like iperf measure UDP?

Given that UDP packets don't actually send acks, how does a program like iperf measure their one-way performance, i.e., how can it confirm that the packets actually reached:
within a time frame
intact, and uncorrupted
To contrast, Intuitively, to me, it seems that TCP packets, which have an ack signal sent back to allow rigorous benchmarking of their movement across a network can be done very reliably from a client.
1/ "how can it confirm that the packets actually reached [...] intact, and uncorrupted"
UDP is an unfairly despised protocol, but come on, this is going way too far here! :-)
UDP have checksum, just like TCP:
https://en.wikipedia.org/wiki/User_Datagram_Protocol#Checksum_computation
2/ "how can it confirm that the packets actually reached [...] within a time frame"
It does not, because this is not what UDP is about, nor TCP by the way.[*]
As can be seen from its source code here:
https://github.com/esnet/iperf/blob/master/src/iperf_udp.c#L55
...what it does though, is check for out of order packets. A "pcount" is set in the sending side, and checked at the receiving side here:
https://github.com/esnet/iperf/blob/master/src/iperf_udp.c#L99
...and somewhat calculate a bogus jitter:
https://github.com/esnet/iperf/blob/master/src/iperf_udp.c#L110
(real life is more complicated than this, you not only have jitter, but also drift)
[*]:
For semi-garanteed, soft "within a time frame" / real time layer 3 and above protocols, look at RTP, RTSP and such. But neither TCP nor UDP inherently have this.
For real, serious hard real-time garantee, you've got to go to layer 2 protocols such as Ethernet-AVB:
https://en.wikipedia.org/wiki/Audio_Video_Bridging
...which were designed because IP and above simply cannot. make. hard. real. time. guaranteed. delivery. Period.
EDIT:
This is another debate, but...
The first thing you need for "within a time frame", is a shared wall clock on sending/receiving systems (else, how could you tell that such received packet is out of date?)
From Layer 3 (IP) and above, NTP precision target is about 1ms. It can be less than that on a LAN (but accross IP networks, it's just taking a chance and hope the best).
On layer 2, aka "LAN" the layer 2 PTP (Precision Time Protocol) IEEE 1588 is for sub-microsecond range. That's a 1000 times more accurate. Same goes for the derived IEEE 802.1AS, "Timing and Synchronization for Time-Sensitive Applications (gPTP)" used In Ethernet AVB.
Conclusion on this sub-topic:
TCP/IP, though very handy and powerful, is not designed to "guarantee delivery within a time frame". Be it TCP or UDP. Get this idea out of your head.
The obvious way would be to connect to a server that participates in the testing.
The client starts by (for example) connecting to an NTP server to get an accurate time base.
Then the UDP client sends a series of packets to the server. In its payload, each packet contains:
a serial number
a timestamp when it was sent
a CRC
The server then looks these over and notes whether any serial numbers are missing (after some reasonable timeout) and compares the time it received each packet to the time the client sent the packet. After some period of time, the server sends a reply indicating how many packets it received, the mean and standard deviation of the transmission times, and an indication of how many appeared to be corrupted based on the CRCs they contained.
Depending on taste, you might also want to set up a simultaneous TCP connection from the client to the server to coordinate testing of the UDP channel and (possibly) return results.

Understanding SPDY latency claims

Reading the SPDY whitepaper at http://dev.chromium.org/spdy/spdy-whitepaper, it seems like supporting it will improve my HTTP latency. However, I'm not clear on a few of the claims.
1) "Because HTTP can only fetch one resource at a time (HTTP pipelining helps, but still enforces only a FIFO queue), a server delay of 500 ms prevents reuse of the TCP channel for additional requests." -- Where did this 500ms number come from?
2) "We discovered that SPDY's latency savings increased proportionally with increases in packet loss rates, up to a 48% speedup at 2%." -- But doesn't putting all the requests on a single TCP connection mean that congestion control will slow down all your requests whereas is you had multiple connections, 1 TCP stream would slow down but others would not?
3) "[With pipelining] any delays in the processing of anything in the stream (either a long request at the head-of-line or packet loss) will delay the entire stream." -- This implies that packet loss would not delay the entire stream using SPDY. Why not?
The 500ms reference is simply an example, the number can be 50ms or 5s, but the point is still the same: HTTP forces FIFO processing, which results in inefficient use of the underlying TCP connection. As the paper notes, pipelining can help in theory, but in practice pipeline is not used due to many intermediaries which break when you turn it on. Hence, you're stuck with the worst case scenario: full RTT + server processing time, and FIFO ordering.
Re, packet loss. Yup, you're exactly right. One of the downsides of using a single connection is that in the case of packet loss, the throughput of the entire connection is cut in half, as opposed to 1/2 of one of the N connections in flight. Having said that, there are also some benefits! For example, when you saturate a single connection, you get much faster recovery due to triple ACK's + potentially much wider congestion windows to begin with.. Due to the fact that most HTTP transfers are relatively small (tens of KB's), it is not unusual for many connections to terminate even before they exit the slow-start phase!
Re, pipelining. Lost packet would delay the stream - that's TCP. The win is in eliminating head-of-line blocking, which enables a lot more and a lot smarter optimization by the browser, followed by some of the wins I described above.
#GroovyDotCom: Here's some hands-on proof of HTTP2's (SPDY's) performance benefits:
http://www.httpvshttps.com/

Speed/Performance Characteristics of blocking vs Non-Blocking winsock

Is there in general a speed or performance difference in blocking and non-blocking Winsock TCP Sockets?
I could get the differences of both sockets but there isn't a detailed performance comparison between the two types.
Because it isn't about speed. The operations write and read are just memory copying in disguise. All they do is copy data to and from the kernel, respectively. I.e. they don't actually send or receive anything.
The blocking vs nonblocking feature asks: do you prefer these operations to block until completed or to return -1 and EAGAIN in case they can't be performed immediately ? For example, you read from a socket but there's nothing in the receive buffer. Do you prefer to have recv hanging until something comes along or to return -1 EAGAIN ?
In my experience non-blocking winsock operations are slightly slower but much more scalable. The fact is that you need to make two system calls plus some dispatching at the application level when you perform nonblocking I/O (with IOCP) and one system call if you use blocking I/O. If you have many concurrent connections, nonblocking I/O is much faster because of more scalable architecture if implemented well.
If you need to transfer data from point to point with max bandwidth - use blocking I/O. If you need to handle many concurrent client connections - use nonblocking I/O. Don't expect too much from any of them.
In general this is more about "event-driven vs threaded" server architecture then "blocking vs nonblocking". There is no universal server architecture that can be used in any situation. It depends on application.

how TCP can be tuned for high-performance one-way transmission?

my (network) client sends 50 to 100 KB data packets every 200ms to my server. there're up to 300 clients. Server sends nothing to client. Server (dedicated) and clients are in LAN. How can I tune TCP configuration for better performance? Server on Windows Server 2003 or 2008, clients on Windows 2000 and up.
e.g. TCP window size. Does changing this parameter help? anything else? any special socket options?
[EDIT]: actually in different modes packets can be up to 5MB
I did a study on this a couple of years ago wth 1700 data points. The conclusion was that the single best thing you can do is configure an enormous socket receive buffer (e.g. 512k) at the receiver. Do that to the listening socket, so it will be inherited by the accepted sockets, so it will already be set while they are handshaking. That in turn allows TCP window scaling to be negotiated during the handshake, which allows the client to know about the window size > 64k. The enormous window size basically lets the client transmit at the maximum possible rate, subject only to congestion avoidance rather than closed receive windows.
What OS?
IPv4 or v6?
Why so large of a dump ; why can't it be broken down?
Assuming a solid, stable, low bandwidth:delay prod, you can adjust things like inflight sizing, initial window size, mtu (depending on the data, IP version, and mode[tcp/udp].
You could also round robin or balance inputs, so you have less interrupt time from the nic .. binding is an option as well..
5MB /packet/? That's a pretty poor design .. I would think it'd lead to a lot of segment retrans's , and a LOT of kernel/stack mem being used in sequence reconstruction / retransmits (accept wait time, etc)..
(Is that even possible?)
Since all clients are in LAN, you might try enabling "jumbo frames" (need to run a netsh command for that, would need to google for the precise command, but there are plenty of how-tos).
On the application layer, you could use TransmitFile, which is the Windows sendfile equivalent and which works very well under Windows Server 2003 (it is artificially rate-limited under "non server", but that won't be a problem for you). Note that you can use a memory mapped file if you generate the data on the fly.
As for tuning parameters, increasing the send buffer will likely not give you any benefit, though increasing the receive buffer may help in some cases because it reduces the likelihood of packets being dropped if the receiving application does not handle the incoming data fast enough. A bigger TCP window size (registry setting) may help, as this allows the sender to send out more data before having to block until ACKs arrive.
Yanking up the program's working set quota may be worth a consideration, it costs you nothing and may be an advantage, since the kernel needs to lock pages when sending them. Being allowed to have more pages locked might make things faster (or might not, but it won't hurt either, the defaults are ridiculously low anyway).

How much overhead does SSL impose?

I know there's no single hard-and-fast answer, but is there a generic order-of-magnitude estimate approximation for the encryption overhead of SSL versus unencrypted socket communication? I'm talking only about the comm processing and wire time, not counting application-level processing.
Update
There is a question about HTTPS versus HTTP, but I'm interested in looking lower in the stack.
(I replaced the phrase "order of magnitude" to avoid confusion; I was using it as informal jargon rather than in the formal CompSci sense. Of course if I had meant it formally, as a true geek I would have been thinking binary rather than decimal! ;-)
Update
Per request in comment, assume we're talking about good-sized messages (range of 1k-10k) over persistent connections. So connection set-up and packet overhead are not significant issues.
Order of magnitude: zero.
In other words, you won't see your throughput cut in half, or anything like it, when you add TLS. Answers to the "duplicate" question focus heavily on application performance, and how that compares to SSL overhead. This question specifically excludes application processing, and seeks to compare non-SSL to SSL only. While it makes sense to take a global view of performance when optimizing, that is not what this question is asking.
The main overhead of SSL is the handshake. That's where the expensive asymmetric cryptography happens. After negotiation, relatively efficient symmetric ciphers are used. That's why it can be very helpful to enable SSL sessions for your HTTPS service, where many connections are made. For a long-lived connection, this "end-effect" isn't as significant, and sessions aren't as useful.
Here's an interesting anecdote. When Google switched Gmail to use HTTPS, no additional resources were required; no network hardware, no new hosts. It only increased CPU load by about 1%.
I second #erickson: The pure data-transfer speed penalty is negligible. Modern CPUs reach a crypto/AES throughput of several hundred MBit/s. So unless you are on resource constrained system (mobile phone) TLS/SSL is fast enough for slinging data around.
But keep in mind that encryption makes caching and load balancing much harder. This might result in a huge performance penalty.
But connection setup is really a show stopper for many application. On low bandwidth, high packet loss, high latency connections (mobile device in the countryside) the additional roundtrips required by TLS might render something slow into something unusable.
For example we had to drop the encryption requirement for access to some of our internal web apps - they where next to unusable if used from china.
Assuming you don't count connection set-up (as you indicated in your update), it strongly depends on the cipher chosen. Network overhead (in terms of bandwidth) will be negligible. CPU overhead will be dominated by cryptography. On my mobile Core i5, I can encrypt around 250 MB per second with RC4 on a single core. (RC4 is what you should choose for maximum performance.) AES is slower, providing "only" around 50 MB/s. So, if you choose correct ciphers, you won't manage to keep a single current core busy with the crypto overhead even if you have a fully utilized 1 Gbit line. [Edit: RC4 should not be used because it is no longer secure. However, AES hardware support is now present in many CPUs, which makes AES encryption really fast on such platforms.]
Connection establishment, however, is different. Depending on the implementation (e.g. support for TLS false start), it will add round-trips, which can cause noticable delays. Additionally, expensive crypto takes place on the first connection establishment (above-mentioned CPU could only accept 14 connections per core per second if you foolishly used 4096-bit keys and 100 if you use 2048-bit keys). On subsequent connections, previous sessions are often reused, avoiding the expensive crypto.
So, to summarize:
Transfer on established connection:
Delay: nearly none
CPU: negligible
Bandwidth: negligible
First connection establishment:
Delay: additional round-trips
Bandwidth: several kilobytes (certificates)
CPU on client: medium
CPU on server: high
Subsequent connection establishments:
Delay: additional round-trip (not sure if one or multiple, may be implementation-dependant)
Bandwidth: negligible
CPU: nearly none

Resources