SSL on MQ : CPU Performance - performance

I would like to deploy SSL on a MQ server but I would like to know if my current CPU capacity will support SSL. (I do not have budget to increase the number of CPU cores and MQ PVU)
My specs:
Windows 2003 Server SP2,
1 core of Intel Xeon CPU E5-2690 2.9GHz,
2 GB RAM,
1 Qmgr,
Linear Logging,
Persistants messages,
DQM with 5 parteners,
10 senders channels,
10 receivers channels
For a month:
we exchange in average 3 million messages with our partners on a total of 15Gbytes of data.
(so 5K per message).
we had in average variations of CPU between 20% and 40%
we had 4 peaks of 100% CPU
Do you think my system can cope SSL with Cipher RC4_MD5_EXPORT ?
Best Regards,
Pascal

I don't think it's possible to provide a definitive answer as to whether your server can cope with enabling SSL using the RC4_MD5_EXPORT cipher on your MQ channels short of trying it and assessing the impact. You may also want to take a look at the processor queue length using the Windows Performance Monitor tool to see how many processes are waiting for CPU time when the usage increases.
As your CPU provides hardware support for the AES encryption algorithm you may want to consider using one of the AES CipherSpecs instead. This also has the advantage of providing better security as both MD5 and RC4 are fairly weak in terms of hashing and encryption.
One option to consider is installing an SSL acceleration card in your server to allow the messages to be encrypted/hashed using dedicated hardware rather than your server's CPU. This page http://www-01.ibm.com/support/knowledgecenter/#!/SSFKSJ_7.5.0/com.ibm.mq.ref.doc/q049300_.htm on the IBM Knowledge Center provide some further information and lists which cards are supported by WebSphere MQ.

Related

High UDP communication latency because of audio rendering (Windows, C++)

I am trying to communicate with an external robot at 1 kHz using UDP protocol with WinSocket library. I am using Windows 10 Pro Version 21H2. In the hardware side, I use a pc with intel core i9 10900X 32 GB RAM and Intel I219.
At a certain point it work pretty well, I did measure the time spent for the communication (both sending and receiving the packets sequentially takes between 200 microseconds and 500 microseconds), and I did also measure using wireshark the number of packets exchanged (1000 packets sent per second and 1000 packets received per second too). The throughput sending is 2 Mbps and receiving is 3Mbps.
The issue start when any audio is rendered (even the sound happening when changing the volume on windows), this lead to a noticeable latency (about 10 to 15 milliseconds).
When I stop the Windows Audio Service, this solves the issue but in our application, we need to render a sound permanently.
graph : round trip time and frequency vs index of udp query, using NIC PCI
A temporary solution was to use a USB/Ethernet adapter instead of NICs. Using this type of device we have no latency but we already experienced in the past some issues related to drops of performance due to thermal throttling.
graph : round trip time and frequency vs index of udp query, using USB/Ethernet adapter
I also did try to reduce the audio process priority, no difference. I also tried to set the affinity mask of the audio service in different threads than my application, no difference neither.
My questions : is there a way to increase the audio latency in order to prioritize the udp communication or to reduce the latency of the udp communication to meet our need of 1 kHz frequency.
This problem is due to the Receive Side Throttle feature some NICs support.
In order to fix it, you need to set the register variable
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Multimedia\SystemProfile\NetworkThrottlingIndex to 0xffffffff and reboot windows.
This registry key is private and internal to Windows OS, it is not supposed to be used publicly and it's not officially supported by Microsoft.

Reducing the CPU impact of producers on RabbitMQ

All,
I've been doing performance testing with RabbitMQ on a mid-grade desktop machine (5th-gen i3) and I've found that while my RabbitMQ can handle dozens of exchanges and even 100-200 queues & consumers fairly well, as soon as I start increasing the # of producers beyond ~30 the CPU usage very quickly goes to 100%. RAM usage is always acceptable (low hundreds of MB).
I'm sending various messages with sizes ranging from 400 bytes to 40kBytes, and interestingly this seems to have only a modest effect on CPU (just RAM).
I've been playing with messaging rates from my producers of between 1/second and 100/second, and this certainly has an effect on CPU but not nearly as much as the # of consumers. For example, 10 producers at 10msg/second is MUCH less of a load than 100 producers generating 1msg/second.
When CPU usage hits the wall I don't see any other red/yellow flags like max. # of Erlang processes, or file descriptor limits, in the RabbitMQ Admin console.
I'm currently using Python3 + Pika 0.10 and the BlockingConnection and Basic_Publish, and no specified delivery type for each of my producers (i.e. not durable). Is the highly asymmetric (and heavy) load from the producer side expected behavior?
Are there techniques I can use to reduced the load my producers put on my RabbitMQ instance?
Thanks and Regards

How to optimize ZeroMQ Performance On Windows (XP SP3)

I have two Windows XP SP3 machines in which I am trying to send 3k ZMQ messages from one to the other. These are both fairly modern system (Dual Quad Core Xeon with 5100 chipset and Dual Hex Core Xeon with 5500 chipset) with server grade Intel gigabit ethernet cards.
The two machines are connected point to point without a switch or router in between.
With pcttcp for performance comparison I am able to send 70MB/s (56% utilization) via TCP from one machine to the other. With ZMQ PUSH/PULL I am only able to get ~28MB/s between the two.
With the sender and receiver on the same machine (the slower machine of the two) I am able to achieve a rate of 97MB/s. (220MB/s in the dual hex core)
The PUSH/PULL channel has a HWM set on both ends. It performs marginally better if the HWM sizes are set to low (~150 messages) rather than a larger value like 1024.
I tried 6000 byte jumbo frames and it go worse. (pcttcp performed marginally better though # 72MB/s)
I tried setting TcpWindowSize to a larger value but it seemed to get worse as well. ZMQ liked the lower size, pcttcp did not change. TcpWindowSize is now set to 32K
Other parameters:
TcpAckFrequency = 1 // would not work with out this.
Tcp1323Opts = 1
Receive Side Scaling enabled
How should I approach finding the bottle neck? What should I expect to achieve with TCP and ZMQ performance? The ZeroMQ web site performance section details tests in which the throughput approaches that of TCP (95%+).
Any performance tips / wisdom (aside from use linux, ;-) ) would be greatly appreciated.
Thanks!!!
Another clue: if I setup multiple sender / receiver pairs between the two systems (same direction, different ports) I am able to achieve a higher aggrigate rate. (a total of ~42MB/s with three)
A quick google pulled this up http://comments.gmane.org/gmane.network.zeromq.devel/10089
The nugget out of that thread is TcpDelAckTicks: [quote]
I got a huge increase of performance (2.4 seconds to 0.4 seconds) after setting TcpDelAckTicks registry value to the machine that does
"apr_socket_accept()" -call in the server code. Client just sends
request and waits for response in loop. There was no change in
performance.
The reason I got there was because I was looking for something around MTU, thinking that it might be network related.
And then I found this http://lists.zeromq.org/pipermail/zeromq-dev/2010-November/007814.html, which has a number of performance tuning recommendations (tho not specifically xp), I won't summarise here, as it would be an almost direct copy and paste (not sure I can be more succinct.)
I'm not sure this'll be helpful, but you might not have spotted them.

What SSL cipher suite has the least overhead?

What SSL cipher suite has the least overhead? A clearly compromised suite would be undesirable, however there age degrees of problems. For instance RC4 is still in the SSL 3.0 specification. What is a good recommendation for a highly traffic website? Would the cipher suite change if it wasn't being used for http?
It depends if you talk about network or CPU overhead.
Network overhead is about packet size. The initial handshake implies some asymmetric cryptography; the DHE cipher suites (when the server certificates is used for digital signatures only) imply a ServerKeyExchange message which will need a few hundred extra bytes compared with a RSA key exchange. This is a one-time cost, and clients will reuse sessions (continuing a previous TLS session with a symmetric-only shortened key exchange).
Also, data is exchanged by "records". A record can embed up to 16 kB worth of data. A record has a size overhead which ranges from 21 bytes (with RC4 and MD5) to 57 bytes (with a 16-byte block cipher such as AES, and SHA-1, and TLS 1.1 or later). So that's at worst 0.34% size overhead.
CPU overhead of SSL is now quite small. Use openssl speed to get some raw figures; on my PC (a 2.4 GHz Core2 from two years ago), RC4 appears to be about twice faster than AES, but AES is already at 160 MBytes/s, i.e. 16 times faster than 100baseT ethernet can transmit. The integrity check (with MD5 or SHA-1) will be quite faster than the encryption. So the cipher suite with the least CPU overhead should be SSL_RSA_WITH_RC4_128_MD5, but it will need some rather special kind of setup to actually notice the difference with, e.g., TLS_RSA_WITH_AES_128_CBC_SHA. Also, on some of the newer Intel processors, there are AES-specific instructions, which will make AES faster than RC4 on those systems (the VIA C7 x86 clones also have some hardware acceleration for some cryptographic algorithms). RC4 may give you an extra edge in some corner cases due to its very small code -- in case your application is rather heavy on code size and you run into L1 cache issues.
(As usual, for performance issues, actual measures always beat theory.)
The ciphersuite with the less overhead is RSA_WITH_RC4_MD5. Note that the way RC4 is used in TLS does not render it broken, as for example in WEP, but still its security can be questioned. It also uses the HMAC-MD5, which also is not the best choice, even though there no attacks known yet. Several web sites (unfortunately) only use that ciphersuite for efficiency. If you use an intel server with AES-NI instructions you might want to experiment with RSA_WITH_AES_128_SHA1. It is faster than RSA_WITH_RC4_MD5 in the systems I've tested.
I was searching about SSL/TLS and bumped into this one. I know the thread is old and just wanted to add a few updates just in case someone gets lost here.
Some ciphers offer more security and some more performance. But since this was posted, several changes to SSL/TLS, most specially on security has been introduced.
For good and always updated ciphers check out this SSL/TLS generator by Mozilla
It is also worth to note that if you are concern with performance, there are other aspects in the SSL connection that you could explore such as:
OCSP stapling
Session resumption (tickets)
Session resumption (caching)
False Start (NPN needed)
HTTP/2

How much overhead does SSL impose?

I know there's no single hard-and-fast answer, but is there a generic order-of-magnitude estimate approximation for the encryption overhead of SSL versus unencrypted socket communication? I'm talking only about the comm processing and wire time, not counting application-level processing.
Update
There is a question about HTTPS versus HTTP, but I'm interested in looking lower in the stack.
(I replaced the phrase "order of magnitude" to avoid confusion; I was using it as informal jargon rather than in the formal CompSci sense. Of course if I had meant it formally, as a true geek I would have been thinking binary rather than decimal! ;-)
Update
Per request in comment, assume we're talking about good-sized messages (range of 1k-10k) over persistent connections. So connection set-up and packet overhead are not significant issues.
Order of magnitude: zero.
In other words, you won't see your throughput cut in half, or anything like it, when you add TLS. Answers to the "duplicate" question focus heavily on application performance, and how that compares to SSL overhead. This question specifically excludes application processing, and seeks to compare non-SSL to SSL only. While it makes sense to take a global view of performance when optimizing, that is not what this question is asking.
The main overhead of SSL is the handshake. That's where the expensive asymmetric cryptography happens. After negotiation, relatively efficient symmetric ciphers are used. That's why it can be very helpful to enable SSL sessions for your HTTPS service, where many connections are made. For a long-lived connection, this "end-effect" isn't as significant, and sessions aren't as useful.
Here's an interesting anecdote. When Google switched Gmail to use HTTPS, no additional resources were required; no network hardware, no new hosts. It only increased CPU load by about 1%.
I second #erickson: The pure data-transfer speed penalty is negligible. Modern CPUs reach a crypto/AES throughput of several hundred MBit/s. So unless you are on resource constrained system (mobile phone) TLS/SSL is fast enough for slinging data around.
But keep in mind that encryption makes caching and load balancing much harder. This might result in a huge performance penalty.
But connection setup is really a show stopper for many application. On low bandwidth, high packet loss, high latency connections (mobile device in the countryside) the additional roundtrips required by TLS might render something slow into something unusable.
For example we had to drop the encryption requirement for access to some of our internal web apps - they where next to unusable if used from china.
Assuming you don't count connection set-up (as you indicated in your update), it strongly depends on the cipher chosen. Network overhead (in terms of bandwidth) will be negligible. CPU overhead will be dominated by cryptography. On my mobile Core i5, I can encrypt around 250 MB per second with RC4 on a single core. (RC4 is what you should choose for maximum performance.) AES is slower, providing "only" around 50 MB/s. So, if you choose correct ciphers, you won't manage to keep a single current core busy with the crypto overhead even if you have a fully utilized 1 Gbit line. [Edit: RC4 should not be used because it is no longer secure. However, AES hardware support is now present in many CPUs, which makes AES encryption really fast on such platforms.]
Connection establishment, however, is different. Depending on the implementation (e.g. support for TLS false start), it will add round-trips, which can cause noticable delays. Additionally, expensive crypto takes place on the first connection establishment (above-mentioned CPU could only accept 14 connections per core per second if you foolishly used 4096-bit keys and 100 if you use 2048-bit keys). On subsequent connections, previous sessions are often reused, avoiding the expensive crypto.
So, to summarize:
Transfer on established connection:
Delay: nearly none
CPU: negligible
Bandwidth: negligible
First connection establishment:
Delay: additional round-trips
Bandwidth: several kilobytes (certificates)
CPU on client: medium
CPU on server: high
Subsequent connection establishments:
Delay: additional round-trip (not sure if one or multiple, may be implementation-dependant)
Bandwidth: negligible
CPU: nearly none

Resources