How to optimize ZeroMQ Performance On Windows (XP SP3)

How to optimize ZeroMQ Performance On Windows (XP SP3) - windows

I have two Windows XP SP3 machines in which I am trying to send 3k ZMQ messages from one to the other. These are both fairly modern system (Dual Quad Core Xeon with 5100 chipset and Dual Hex Core Xeon with 5500 chipset) with server grade Intel gigabit ethernet cards.
The two machines are connected point to point without a switch or router in between.
With pcttcp for performance comparison I am able to send 70MB/s (56% utilization) via TCP from one machine to the other. With ZMQ PUSH/PULL I am only able to get ~28MB/s between the two.
With the sender and receiver on the same machine (the slower machine of the two) I am able to achieve a rate of 97MB/s. (220MB/s in the dual hex core)
The PUSH/PULL channel has a HWM set on both ends. It performs marginally better if the HWM sizes are set to low (~150 messages) rather than a larger value like 1024.
I tried 6000 byte jumbo frames and it go worse. (pcttcp performed marginally better though # 72MB/s)
I tried setting TcpWindowSize to a larger value but it seemed to get worse as well. ZMQ liked the lower size, pcttcp did not change. TcpWindowSize is now set to 32K
Other parameters:
TcpAckFrequency = 1 // would not work with out this.
Tcp1323Opts = 1
Receive Side Scaling enabled
How should I approach finding the bottle neck? What should I expect to achieve with TCP and ZMQ performance? The ZeroMQ web site performance section details tests in which the throughput approaches that of TCP (95%+).
Any performance tips / wisdom (aside from use linux, ;-) ) would be greatly appreciated.
Thanks!!!
Another clue: if I setup multiple sender / receiver pairs between the two systems (same direction, different ports) I am able to achieve a higher aggrigate rate. (a total of ~42MB/s with three)

A quick google pulled this up http://comments.gmane.org/gmane.network.zeromq.devel/10089
The nugget out of that thread is TcpDelAckTicks: [quote]
I got a huge increase of performance (2.4 seconds to 0.4 seconds) after setting TcpDelAckTicks registry value to the machine that does
"apr_socket_accept()" -call in the server code. Client just sends
request and waits for response in loop. There was no change in
performance.
The reason I got there was because I was looking for something around MTU, thinking that it might be network related.
And then I found this http://lists.zeromq.org/pipermail/zeromq-dev/2010-November/007814.html, which has a number of performance tuning recommendations (tho not specifically xp), I won't summarise here, as it would be an almost direct copy and paste (not sure I can be more succinct.)
I'm not sure this'll be helpful, but you might not have spotted them.

Related

Is the tcp/ip optimal protocol for transfering large amount of data?

Tcp/ip is universal that suite for most of cases. And as a general solution it is not optimal for specific cases:
1) To transfer data over the continents with packet loss. (As example [Appera1, for some cases it makes transfer faster 10x times.)
2) For gigabyte LAN with no packet loss. Here TCP/IP intruduce overhead with ACK and things that are for long dinstance and slow networks. I remember that read about some protocol for gigabyte LAN that is significantly faster than TCP/IP.
The last one is interesting for backup the solution that should transfer huge amount of data. What do you know about alternative network data transfer protocols for windows?

If you are doing backup, I'm guessing #2 is the case you're concerned about.
TCP has several optimizations to address #2: sliding windows, windows scaling, and fast retransmit and recovery if congestion should occur. As long as the receiver's window is open, ACKs don't gate the effective bandwidth.
Since this question is on SO, I'm assuming programming is involved, so in implementing your receiving program you can keep window open by providing large buffers. Use the bandwidth-delay product in determining buffer size. You can dynamically compute this, or if your environment is stable, then you can use a static calculation.
Regarding windows protocols you have two choices. "In the box" and 3rd party. You can view in the box protocol by going to Control Panel, Network, Change Adpater settings (for your gigE adapter), Properties, Install, Protocol. On my 2008R2 system I only see Microsoft Virtual Switch Protocol, and Reliable Multicast Protocol. Neither would help unless you want to backup to multiple locations simultaneously (using multicast).
As far as 3rd party protocols go, that's really beyond the scope of SO. A couple of well chosen web searches will fill the bill for that.
And if you're going for absolute fastest speed and your backup source and target are in same broadcast domain, you can skip IP altogether and program at the MAC layer. You'll lose a lot of functionality, but if you do it well it'll be fast.

How to utilize all available bandwidth with real-time data?

How to measure actual bandwidth between server and client to decide how much of real-time data to send?
My server sends read-time data to clients, 30 times per second. If server has too much data it prioritises data chunks and throws away anything that doesn't fit into available bandwidth because this data will be invalidated next tick anyway. Data is sent over reliable (20%) and unreliable channels (80%) (both UDP based but if TCP as a reliable channel can provide any benefit please let me know). Data is highly latency-sensitive. Server often (but not always!) has more data than available bandwidth. It's critical to send as much data as possible but not more than available bandwidth to avoid packets drop or higher latency.
Server and client are custom applications so can implement any algorithm/protocol.
My main problem is how to keep track of available bandwidth. Also any statistical info about typical bandwidth jitter would be helpful (servers are in a cloud, clients are home users, worldwide).
At the moment I'm thinking how to utilize:
latency info of reliable channel. It should correlate with bandwidth because if latency grows this can (!) mean retransmission is involved as result of packets drop and so server must lower data rate.
data amount received by client on unreliable channel during time frame. Especially if data amount is lower than what was sent from server.
if current latency is close to or below lowest recorded one, bandwidth can be increased
The problem is that this approach is too complicated and involves a lot of "heuristics" like what should be a step to increase/decrease bandwidth etc.
Looking for any advice from people who dealt with similar problem in the past or just any bright ideas

The first symptom of trying to use more bandwidth than you actually have will be increased latency, as you fill up the buffers between the sender and whatever the bottleneck is. See https://en.wikipedia.org/wiki/Bufferbloat. My guess is that if you can successfully detect increased latency as you start to fill up the bandwidth and back off then you can avoid packet loss.
I wouldn't underestimate TCP - people have spent a lot of time tuning its congestion avoidance to get a reasonable amount of the available bandwidth while still being a good network citizen. It may not be easy to do better.
On the other hand, a lot will depend on the attitude of the intermediate nodes, which may treat UDP differently from TCP. You may find that under load they either prioritize or discard UDP. Also some networks, especially with satellite links, may use https://en.wikipedia.org/wiki/TCP_acceleration without you even knowing about it. (This was a painful surprise for us - we relied on the TCP connection failing and keep-alive to detect loss of connectivity. Unfortunately the TCP accelerator in use maintained a connection to us, pretending to be the far end, even when connectivity to the far end had in fact been lost).

After some research, the problem has a name: Congestion Control, or Congestion Avoidance Algorithm. It's quite a complicated topic and there're lots of materials about it. TCP Congestion Control was evolving over time and is really good one. There're other protocols that implement it, e.g. UDT or SCTP

how TCP can be tuned for high-performance one-way transmission?

my (network) client sends 50 to 100 KB data packets every 200ms to my server. there're up to 300 clients. Server sends nothing to client. Server (dedicated) and clients are in LAN. How can I tune TCP configuration for better performance? Server on Windows Server 2003 or 2008, clients on Windows 2000 and up.
e.g. TCP window size. Does changing this parameter help? anything else? any special socket options?
[EDIT]: actually in different modes packets can be up to 5MB

I did a study on this a couple of years ago wth 1700 data points. The conclusion was that the single best thing you can do is configure an enormous socket receive buffer (e.g. 512k) at the receiver. Do that to the listening socket, so it will be inherited by the accepted sockets, so it will already be set while they are handshaking. That in turn allows TCP window scaling to be negotiated during the handshake, which allows the client to know about the window size > 64k. The enormous window size basically lets the client transmit at the maximum possible rate, subject only to congestion avoidance rather than closed receive windows.

What OS?
IPv4 or v6?
Why so large of a dump ; why can't it be broken down?
Assuming a solid, stable, low bandwidth:delay prod, you can adjust things like inflight sizing, initial window size, mtu (depending on the data, IP version, and mode[tcp/udp].
You could also round robin or balance inputs, so you have less interrupt time from the nic .. binding is an option as well..
5MB /packet/? That's a pretty poor design .. I would think it'd lead to a lot of segment retrans's , and a LOT of kernel/stack mem being used in sequence reconstruction / retransmits (accept wait time, etc)..
(Is that even possible?)

Since all clients are in LAN, you might try enabling "jumbo frames" (need to run a netsh command for that, would need to google for the precise command, but there are plenty of how-tos).
On the application layer, you could use TransmitFile, which is the Windows sendfile equivalent and which works very well under Windows Server 2003 (it is artificially rate-limited under "non server", but that won't be a problem for you). Note that you can use a memory mapped file if you generate the data on the fly.
As for tuning parameters, increasing the send buffer will likely not give you any benefit, though increasing the receive buffer may help in some cases because it reduces the likelihood of packets being dropped if the receiving application does not handle the incoming data fast enough. A bigger TCP window size (registry setting) may help, as this allows the sender to send out more data before having to block until ACKs arrive.
Yanking up the program's working set quota may be worth a consideration, it costs you nothing and may be an advantage, since the kernel needs to lock pages when sending them. Being allowed to have more pages locked might make things faster (or might not, but it won't hurt either, the defaults are ridiculously low anyway).

Cannot achieve full speed on Symmetrical Internet Connection

We are using a business Ethernet connection (3Mbit upload, 3Mbit download) and trying to understand issues with our tested bandwidth speeds. When uploading a large file we sustain 340 KB/s; downloading we sustain 340KB/s. However when we run these transfers simultaneously the two transfer speeds rise and fall erratically with a average speed for both at around 250 KB/s. We're using a Hatteras HN404 CPi and we've bypassed the router (plugged a machine directly into the Hatteras; set the NIC to full-duplex).
Is this expected? Should a max upload interfere with a max download on this type of Internet connection?

Are you sure the bottleneck is your connection?
Do you also see this behavior when the simultaneous upload and download are occurring on different systems, or only when one system is handling both the upload and download?
If the problem goes away when independent machines are doing the work, the bottleneck is likely closer to the hard drive.

This sounds expected from my experience with lower end lines. On a home line, I've found that traffic shaping and changing buffer sizes can be a huge help.
TCP/IP without any unusual traffic shaping will favor the most aggressive traffic at the expense of everything else. In your case, this means responses to the outgoing ACKs and such for the download will be delayed or maybe even dropped. See if your HN404 supports class based queuing or something similar and try it out.

Yes it is expected. This is symptomatic of any case in which you have a throttled or capped connection. If you saturate your uplink it will affect your downlink and vice versa.
This is because the your connection's rate-limiting impacts the TCP handshake acknowledgement packets (ACKs) and disrupts the normal "balance" of how these packets flow.
This is very thoroughly described on this page about Cable Modem Troubleshooting Tips, although it is not limited to cable modems:
If you saturate your cable modem's
upload cap with an upload, the ACK
packets of your download will have to
queue up waiting for a gap between the
congested upload data packets. So your
ACKs will be delayed getting back to
the remote download server, and it
will therefore believe you are on a
very slow link, and slow down the
transmission of further data to you.
So how do you avoid this? The best way is to implement some sort of traffic-shaping or QoS (Quality of Service) on individual sessions to limit them to a maximum throughput based on a percentage of your total available bandwidth.
For example on my home network I have it so that no outbound connection can utilize any more than 67% (2/3rd) of my 192Kbps uplink. That means any single outbound session can only utilized 128Kbps, therefore protecting my downlink speed by preventing the uplink from becoming saturated.
In most cases you are able to perform this kind of traffic-shaping based on any available criteria such as source ip, destination ip, protocol, port, time of day, etc.

It appears that I was wrong about the simultaneous transfer speeds. The 250KB/s speeds up and down were miscalculated by the transfer program (seemed to have been showing a high average speed). Apparently the Business Ethernet (in this case it is an XO circuit provisioned by Speakeasy) only supports 3Mb total, not up AND down (for 6Mbit total). So if I am transferring up and down at the same time in theory I should only have 1.5Mbit up and down or 187.5KB/s at the maximum (if there was zero overhead).

How can I estimate ethernet performance?

I need to think about performance limitations of 100 mbps ethernet (including scenarios with up to ~100 endpoints on the same subnet) and I'm wondering how best to go about estimating the capacity of the network. Are there any rules of thumb for this?
The reason I ask is that I am working on some back-of-the-envelope level calculations about performance limitations, so it doesn't need to be incredibly accurate. I just haven't been through this exercise before and was hoping to gain some insight from those who have. Mark Brackett's answer (as of 1/26) is along the lines of what I am looking for.

If you're using switches (and, honestly, who isn't these days) - then I've found 80% of capacity a reasonable estimate. Usually, it's really about 90% because of TCP overhead - but 80% accounts for occasional retransmits.
If it's a single collision domain (hubs), then you'd probably be around 30% with moderate activity on those 100 nodes. But, it'd be pretty variable based on the traffic generated. And anyone putting 100 nodes in a single CD these days would no doubt be shot - so I don't think you'll actually run into those IRL.
Edit: Note that these numbers are for a relatively healthy network - one that is generally defined as working. Extremely excessive broadcasts or other anomalous traffic patterns have been known to bring a network to it's knees.

Use WANem
WANem is a Wide Area Network Emulator,
meant to provide a real experience of
a Wide Area Network/Internet, during
application development / testing over
a LAN environment.
You can simulate any network scenario using it and then test your application's behaviour using it. It is open-source and is available with sourceforge.
Link : WANem - The Wide Area Network emulator

Opnet creates software for simulating network performance. I once used Opnet IT Guru Academic edition. Maybe this application or some other software from opnet may be of some help.

100 endpoints are not suppose to be an issue. If the network is properly configured (nothing special) the only issue is the bandwidth. Fast Ethernet (100 mbps) should be able to transfer almost 10Mb (bytes) per second. It is able to transfer it to one client or to many. If you are using hubs instead of switches. And if you are using half-duplex instead of full-duplex. Then you should change that( this is the rule of thumb).

Working from the title of your post, "How can I estimate Ethernet performance", see this wiki link; http://en.wikipedia.org/wiki/Ethernet_frame#Maximum_throughput

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio