Low latency/high performance network (ethernet) messaging - windows

Background
I want to create a test application to test the network performance of different systems. To do this I plan to have that machine send out Ethernet frames over a private (otherwise non-busy) network to another machine (or device) that simply receives the message and sends it back. The sending application will record total roundtrip time (among other things).
The purpose of the tests is to see how a particular system (OS + components etc.) performs when it comes to networking traffic. This is illustrated as machine A in the picture below. Note that I'm not interested in the performance of the networking infrastructure (switches, cables etc) - I'm trying to test the performance of network traffic inside Machine A (i.e from when it hits the network card until it reaches user space)
We will (try to) measure all kind of things, one thing is the total roundtrip of the message but also things like interrupt latency of Machine A, general driver overhead etc. Machine A will be a real-time system. But to support these tests, I need a separate machine that can bounce back messages and in other ways add network stimuli to the tested system. This separate machine is Machine B in the picture below and is what this question is about.
My problem
I want to develop an application that can receive and return these messages with as consistent (and preferably low) latency as possible. I'm hoping to get latencies that are consistent within a few microseconds at least. For simplicity, I'd like to do this on a general purpose OS like Windows or Linux but I'm open for other suggestions. There will be no other load (CPU or otherwise) on the machine besides the operating system and my test application.
I've thought of the following approaches:
A normal application running in user space with a high priority
A thread running in kernel space to avoid the userspace/kernelspace transitions
An of-the-shelf device that already does this (I haven't found one though)
Questions
Are there any other approaches or perhaps frameworks that already does this? What else do I need to think of to gain a consistent and low latency? What approach is recommended?

You mentioned that you want to test the internal performance of Machine A, but "need a separate machine"; yet, you don't want to test network infrastructure performance.
You know much more about your requirements than I do; however, if I was testing network infrastructure in Machine A, I would set up my test like this:
There are couple of reasons for this:
You can use an Ethernet loopback cable to simulate the "pong" function performed by Machine B
Eliminating transit through infrastructure you don't care about is almost always a better solution when measuring performance
If you use this test method, be sure to note these points:
Ethernet performs a signal to noise test on the copper before it sets up a link. If you make your loopback bends too tight, you could introduce more latency if ethernet decides to fall back to a lower speed due to the kinks in the cable. There is no minimum length for copper ethernet cabling.
As you're probably aware, combinations of NICs / driver versions / OS can have a significant affect on intra-host latency. I work for a network equipment manufacturer, and one of the guys in the office used to work as an applications engineer for SolarFlare. He claims that many of the Wall Street trading systems use SolarFlare's NICs due to the low latency SolarFlare engineers their products for; he also said SolarFlare's drivers give you user-space access to the NIC buffers. Caveat: third-hand info, and I cannot verify myself.
If you loop the frames to Machine A, set the source and destination mac-address to the burned-in-address on the NIC
Even if you need to receive a modified "pong" frame from Machine B, you could still use this topology and simply rewrite packet fields on the receive-side of your code in Machine A. Put as many (or few) instrumentation points as you like in Machine A's "modules" to compare frame timestamps.
FYI:
The embedded systems I mentioned in my comments on your question are for measuring latency of network infrastructure, not end hosts. This is the best method I can think of for instrumenting host latency.

As an off the shelf solution, I would suggest taking a look at Solace, Tibco and AMQP. These are all enterprise messaging frameworks used extensively in trading applications. AMQP is open source and capable of handling throughputs of up to 100,000 messages per second. I am not sure of the latencies of other frameworks. There is a Java or C++ implementation of the AMQP message router. The C++ one of course returns higher performance.
Edit I've just heard of a new product called UltraMessaging which can provide 7,000,000 messages per second throughput with Java, C++ or C# clients. Crikey.
Best regards,

Related

Simulate slow speed for TCP sockets in Windows

I'm building an application that uses TCP sockets to communicate. I want to test how it behaves under slow-speed conditions.
There are similar question on the site, but as I understand it, they deal with HTTP traffic, or are about Linux. My traffic is not HTTP, just ordinary TCP sockets, and the OS is Windows.
I tried using fiddler's setting for Modem Speed but it didn't work, it seems to work only for HTTP connections.
While it is true that you probably want to invest in an extensive set of unit tests, You can simulate various network conditions using VMWare Workstation:
You will have to install a virtual machine for testing, setup bridged networking (for the vm to access your real network) and upload your code to the vm.
After that you can start changing the settings and see how your application performs.
NetLimiter can also be used, but it has fewer options (in your case, packet loss is very interesting to test and is not available in netlimiter).
There is an excellent utility for Windows that can do throttling and much more:
https://jagt.github.io/clumsy/
I think you're taking the wrong approach here.
You can achieve everything that you need with some well designed unit tests. All of the things that a slow network link causes can be simulated in a unit test environment in controlled conditions.
Things that your code MUST handle to deal with "slow" links are just things that you should be dealing with anyway, including:
The correct handling of fragmented messages. All of your network reading code needs to correctly assume that each read will return between 1 byte and the size of your read buffer. You should never assume that you'll get complete 'messages' as TCP knows nothing of your concept of messages.
TCP flow control causing either your synchronous sends to fail with some form of 'try later' error or your async sends to succeed and potentially use an uncontrolled amount of resources (see here for more details). Note that this can happen even on 'fast' links if you are sending faster than the receiver is consuming.
Timeouts - again this isn't limited to "slow" links. All of your timeout handling code should be robust and tested. You may want to make sure that any read timeout is based on any read completing rather than reading a complete message in x time. You may be getting your data at a slow rate but whilst you're still getting data the link is alive.
Connection failure - again not something specific to "slow" links. You need to know how you deal with connections being reset at any time.
In summary nothing you can achieve by running your client and server on a simulated slow network cannot be achieved with a decent set of unit tests and everything that you would want to test on such a link is something that could affect any of your connections on any speed of link.

How slow are TCP sockets compared to named pipes on Windows for localhost IPC?

I am developing a TCP Proxy to be put in front of a TCP service that should handle between 500 and 1000 active connections from the wild Internet.
The proxy is running on the same machine as the service, and is mostly-transparent. The service is for the most part unaware of the proxy, the only exception being the notification of the real remote IP address of the clients.
This means that, for every inbound open TCP socket, there are two more sockets on the server: the secondth of the pair in the Proxy, and the one on the real service behind the proxy.
The send and recv window sizes on the two Proxy sockets are set to 1024 bytes.
What are the performance implications on this? How slow is this configuration? Should I put some effort on changing the service to use Named Pipes (or other IPC mechanism), or a localhost TCP socket is for the most part an efficient IPC?
The merge of the two apps is not an option. Right now we are stuck with the two process configuration.
EDIT: The reason for having two separate process on the same hardware is 100% economics. We have one server only, and we are not planning on getting more (no money).
The TCP service is a legacy software in Visual Basic 6 which grew beyond our expectations. The proxy is C++. We don't have the time, money nor manpower to rewrite and migrate the VB6 code to a modern programming environment.
The proxy is our attempt to mitigate a specific performance issue on the service, a DDoS attack we are getting from time to time.
The proxy is open source, and here is the project source code.
It will be the same (or at least not measurably different). Winsock is smart enough to know if it's talking to a socket on the same host and, in that case, it will short-circuit pretty much everything below IP and copy data directly buffer-to-buffer. In terms of named pipes vs. sockets, if you need to potentially be able to communicate to different machines ever in the future, choose sockets. If you know for a fact that you'll never need to do that, pick whichever one your developers are most familiar or most comfortable with.
For anyone that comes to read this later, I want to add some findings that answer the original question.
For a utility we are developing we have a networking class that can use named pipes, or TCP with the same calls.
Here is a typical loop back file transfer on our test system:
TCP/IP Transfer time: 2.5 Seconds
Named Pipes Transfer time: 3.1 Seconds
Now, if you go outside the machine and connect to a remote computer on your network the performance for named pipes is much worse:
TCP/IP Transfer time: 12 Seconds
Named Pipes Transfer time: 2.5 Minutes (Yes Minutes!)
I realize that this is just one system (Windows 7) But I think it is a good indicator of how slow named pipes can be...and it seems like TCP is the way to go.
I know this topic is very old, but it was still relevant for me, and maybe others will look at this in the future as well.
I implemented IPC between Excel (VBA) and another process on the same machine, both via a TCP connection as well as via Named Pipes.
In a quick performance test, I submitted a message than consisted of 26 bytes from client (Excel) to server (not Excel), and waited for the reply message from the other process (which consisted of 12 bytes in the example).
I executed this a ton of times in a loop and measured the average execution time.
With TCP on localhost (Windows 7, no fastpath), one "conversation" (request+reply) took around 300-350 microseconds. Especially sending data was quite slow (sending the 26 bytes took around 200microseconds via TCP).
With Named Pipes, one conversation took around 60 microseconds on average - so a LOT faster.
I'm not entirely sure why the difference was so large. The corporate environment I tested this in has a strict firewall, package inspections and what not, so I THINK this may have been caused as even the localhost-based TCP connection went through security measures significantly slowing it down, while named pipe ones likely did not.
TL:DR: In my case, Named Pipes were around 5-6 times faster than TCP for small packages (have not tested with bigger ones yet)
http://msdn.microsoft.com/en-us/library/aa178138(v=sql.80).aspx
Let me sum it up for you. If you are worried about performance then use TCP/IP. But if you have a really fast network and your not worried about performance then Named Pipes would be "neat" in that it might save you some code.
Not to mention, if you stick to TCP then you will have something that can be scaled, and even load balanced when the time comes.
Cheers,
In the scenario you describe, the local TCP connections are very unlikely to be a bottleneck. It will introduce some overhead, of course, but this should be negligible unless your CPU is already running hot.
At a guess, if your server's CPU usage is normally below 50% or so (with the proxy in place) it isn't worth worrying about minimizing the overhead associated with the local TCP connections.
If CPU usage is regularly above 80% you should probably be doing some profiling. I'd start by comparing the CPU load (or, better still, the performance, if you can measure it meaningfully) when the proxy is in place to when it isn't. Unless the proxy is doing some complicated processing, the overhead associated with the extra TCP connections is probably a significant fraction of the total overhead introduced by the proxy, so that should give you at least an order-of-magnitude estimate of how much you'd gain by using a more efficient form of IPC.
What is the reason to have a proxy on the SAME machine, just curious?
Anyway:
There are several methods for IPC, TCP/IP, named Pipes are comparable in speed and complexity. If you really want something that scales well and has almost no overhead: use shared memory. Best used in combination with a lock free algorithm for advancing the pointers (or use one buffer for each reader (the proxy/the service) and writer(the service/the proxy)).

Simulating Slow Internet Connection [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
The community reviewed whether to reopen this question 3 days ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I know this is kind of an odd question. Since I usually develop applications based on the "assumption" that all users have a slow internet connection. But, does anybody think that there is a way to programmatically simulate a slow internet connection, so I can "see" how an application performs under various "connection speeds"?
I'm not worried about which language is used. And I'm not looking for code samples or anything, just interested in the logic behind it.
Starting with Chrome 38 you can do this without any plugins. Just click inspect element (or F12 hotkey), then click on "toggle device mod" and you will see something like this:
Among many other features it allows you to simulate specific internet connection (3G, GPRS)
P.S. for people who try to limit the upload speed. Sadly at the current time it is not possible.
P.S.2 now you do not need to toggle anything. Throttling panel is available right from the network panel.
Note that while clicking on the No throttling you can create your custom throttling options.
If you're running windows, fiddler is a great tool. It has a setting to simulate modem speed, and for someone who wants more control has a plugin to add latency to each request.
I prefer using a tool like this to putting latency code in my application as it is a much more realistic simulation, as well as not making me design or code the actual bits. The best code is code I don't have to write.
ADDED: This article at Pavel Donchev's blog on Software Technologies shows how to create custom simulated speeds: Limiting your Internet connection speed with Fiddler.
Google recommends:
Network Link Conditioner on OSX
Clumsy on Windows
Dummynet on Linux
On Linux machines u can use wondershaper
apt-get install wondershaper
$ sudo wondershaper {interface} {down} {up}
the {down} and {up} are bandwidth in kpbs
So for example if you want to limit the bandwidth of interface eth1 to 256kbps uplink and 128kbps downlink,
$ sudo wondershaper eth1 256 128
To clear the limit,
$ sudo wondershaper clear eth1
I was using http://www.netlimiter.com/ and it works very well. Not only limit speed for single processes but also shows actual transfer rates.
There are TCP proxies out there, like iprelay and Sloppy, that do bandwidth shaping to simulate slow connections. You can also do bandwidth shaping and simulate packet loss using IP filtering tools like ipfw and iptables.
You can try Dummynet, it can simulates queue and bandwidth limitations, delays, packet losses, and multipath effects
Use a web debugging proxy with throttling features, like Charles or Fiddler.
You'll find them useful web development in general. The major difference is that Charles is shareware, whereas Fiddler is free.
Also, for simulating a slow connection on some *nixes, you can try using ipfw. More information is provided by Ben Newman's answer on this Quora question
You can use NetEm (Network Emulation) as a proxy server to emulate many network characteristics (speed, delay, packet loss, etc.). It controls the networking using iproute2 package and it's enabled in the kernel of most Linux distributions.
It is controlled by the tc command-line application (from the iproute2 package), but there are also some web interface GUIs for NetEm, for example PHPnetemGUI2.
The advantage is that, as I wrote, it can emulate not only different network speeds but also, for example, the packet loss, duplication and/or corruption, random or defined delay, etc., so apart from the slow connections, you can also emulate various poorly performing networks and transmission errors.
For your application it's absolutely transparent, you can configure the operating system to use the NetEm as a proxy server, so all connections from that machine will be routed through it. Or you can configure only a specific application to use that proxy.
I have been using it to test the performance of an Android app on various emulated poor-performance networks.
Use a tool like TCPMon. It can fake a slow connection.
Basically, you request it the exact same thing and it just forwards the exact same request to the real server, and then delays the response with only the set amount of bytes.
For Linux, the following list of papers might be useful:
A Comparative Study of Network Link Emulators (2009)
KauNet: A Versatile and Flexible Emulation System (2009)
Dummynet Revisited (2010)
Measuring Accuracy and Performance of Network Emulators (2015)
Personally, whilst Dummynet is good, I find NetEm to be the most versatile for my use-cases; I'm usually interested in the effect of delays, rather than bandwidth (i.e. WiFi connection issues), and it's super-easy to emulate random packet loss/corruption, etc. It's also very accessible, and free (unlike the hardware-based Linktropy).
On a side-note, for Windows, Clumsy is awesome. I would also like to add that (regarding websites) browser throttling is not an accurate method for emulating real-life network issues (I think "TKK" commented on a few of the reasons why above).
Hope this helps someone!
One common case of shaping a single TCP connection can actually be assembled from dual pairs of socat and cpipe in UNIX fashion like this:
socat TCP-LISTEN:5555,reuseaddr,reuseport,fork SYSTEM:'cpipe -ngr -b 1 -s 10 | socat - "TCP:localhost:5000" | cpipe -ngr -b 1 -s 300'
This simulates a connection with bandwidth of approximately 300kB/s from your service at :5000 and to at approximately 10kB/s and listens on :5555 for incoming connections. Caveat: Note that this per-connection, so each individual TCP connection gets this amount.
Explanation:
The outer (left) socat listens with the given options on :5555 as a forking server. The first cpipe command in the SYSTEM:... option then throttles data that went into socket :5555 (and comes out of the first, outer socat) to at most 10kByte/s. That data is then forwarding using another socat which connects to localhost:5000 (where the service you want to slow down should be listening). Data from localhost:5000 is then put into the right cpipe command, which (with the given values) throttles it to about 300kB/s.
The option -ngr to cpipe is important. It causes cpipe to read non-greedily from its input file-descriptor. Otherwise, you might get stuck with data in the buffers not being forwarded and waiting for a reply.
Using the more common buffer tool instead of cpipe is likely possible as well.
(Credits: This is based on the "double-tee" recipe by Christophe Loor from the socat documentation)
Mac OSX since 10.10 has an app called Murus Firewall, which acts as a GUI to pf, the replacement for ipfw.
It works very well for system-wide or domain-specific throttling. I was just able to use it to slide my download speed between 300Kbps and 30Mbps to test how a streaming video player adjusts.
Updating this (9 years after it was asked) as the answer I was looking for wasn't mentioned:
Firefox also has presets for throttling connection speeds. Find them in the Network Monitor tab of the developer tools. Default is 'No throttling'.
Slowest is GPRS (Download speed: 50 Kbps, Upload speed: 20 Kbps, Minimum latency (ms): 500), ranging through 'good' and 'regular' 2G, 3G and 4G to DSL and WiFi (Download speed: 30Mbps, Upload speed: 15Mbps, Minimum latency (ms): 2).
More in the Dev Tools docs.
There is also another tool called WIPFW - http://wipfw.sourceforge.net/
It's a bit old school, but you can use it to simulate a slower connection. It's Windows based, and the tool allows the administrator to monitor how much traffic the router is getting from a certain machine, or how much WWW traffic it is forwarding, for example.
There is a simple and practical way to do it, without any application or code. Just connect to the internet using a mobile hotspot. Keep moving the hotspot (phone) away from the connected device to simulate slower networks. 😉

Testing file transfer speed across LAN/WAN

Is there a utility for Windows that allows you to test different aspects of file transfer operations across a Lan or a Wan.
Example...
How long does it take to move a file of a known size (500 MB or 1 GB) from Server A (on site) to Server B (on site) or to Server C (off site-Satellite location)?
D-ITG will allow you to test many aspects of your links. It does not necessarily allow you transfer a file directly, but it allows you to control almost all aspects of the transmission of data across the wire.
If all you are interested in is bulk transfer time (and not all the nitty-gritty details) you could just use a basic FTP application and time the transfer.
Probably nothing you've not already figured out. You could get some coarse grain metrics using a batch file to coordinate:
start monitoring
copy file
stop monitoring
Copy file might just be initiating a file copy between two nodes on the LAN, or it might initiate a FTP copy between two nodes on the WAN.
Monitoring could be as basic as writing the current time to output or file, or it could be as complex as adding performance counter metrics from the network adapter on the two machines.
A commercial WAN emulator would also give you the information your looking for. I've used the Shunra Appliance successfully in the past. Its pretty expensive, so I'd really only recommend it if critical business success is riding on understanding how application behavior could change based on network conditions and is something you could incorporate into regular testing activities.

Common Issues in Developing Cluster Aware non-web-based Enterprise Applications

I've to move a Windows based multi-threaded application (which uses global variables as well as an RDBMS for storage) to an NLB (i.e., network load balancer) cluster. The common architectural issues that immediately come to mind are
Global variables (which are both read/ written) will have to be moved to a shared storage. What are the best practices here? Is there anything available in Windows Clustering API to manage such things?
My application uses sockets, and persistent connections is a norm in the field I work. I believe persistent connections cannot be load balanced. Again, what are the architectural recommendations in this regard?
I'll answer the persistent connection part of the question first since it's easier. All good network load-balancing solutions (including Microsoft's NLB service built into Windows Server, but also including load balancing devices like F5 BigIP) have the ability to "stick" individual connections from clients to particular cluster nodes for the duration of the connection. In Microsoft's NLB this is called "Single Affinity", while other load balancers call it "Sticky Sessions". Sometimes there are caveats (for example, Microsoft's NLB will break connections if a new member is added to the cluster, although a single connection is never moved from one host to another).
re: global variables, they are the bane of load-balanced systems. Most designers of load-balanced apps will do a lot of re-architecture to minimize dependence on shared state since it impedes the scalabilty and availability of a load-balanced application. Most of these approaches come down to a two-step strategy: first, move shared state to a highly-available location, and second, change the app to minimize the number of times that shared state must be accessed.
Most clustered apps I've seen will store shared state (even shared, volatile state like global variables) in an RDBMS. This is mostly out of convenience. You can also use an in-memory database for maximum performance. But the simplicity of using an RDBMS for all shared state (transient and durable), plus the use of existing database tools for high-availability, tends to work out for many services. Perf of an RDBMS is of course orders of magnitude slower than global variables in memory, but if shared state is small you'll be reading out of the RDBMS's cache anyways, and if you're making a network hop to read/write the data the difference is relatively less. You can also make a big difference by optimizing your database schema for fast reading/writing, for example by removing unneeded indexes and using NOLOCK for all read queries where exact, up-to-the-millisecond accuracy is not required.
I'm not saying an RDBMS will always be the best solution for shared state, only that improving shared-state access times are usually not the way that load-balanced apps get their performance-- instead, they get performance by removing the need to synchronously access (and, especially, write to) shared state on every request. That's the second thing I noted above: changing your app to reduce dependence on shared state.
For example, for simple "counters" and similar metrics, apps will often queue up their updates and have a single thread in charge of updating shared state asynchronously from the queue.
For more complex cases, apps may swtich from Pessimistic Concurrency (checking that a resource is available beforehand) to Optimistic Concurrency (assuming it's available, and then backing out the work later if you ended up, for example, selling the same item to two different clients!).
Net-net, in load-balanced situations, brute force solutions often don't work as well as thinking creatively about your dependency on shared state and coming up with inventive ways to prevent having to wait for synchronous reading or writing shared state on every request.
I would not bother with using MSCS (Microsoft Cluster Service) in your scenario. MSCS is a failover solution, meaning it's good at keeping a one-server app highly available even if one of the cluster nodes goes down, but you won't get the scalability and simplicity you'll get from a true load-balanced service. I suspect MSCS does have ways to share state (on a shared disk) but they require setting up an MSCS cluster which involves setting up failover, using a shared disk, and other complexity which isn't appropriate for most load-balanced apps. You're better off using a database or a specialized in-memory solution to store your shared state.
Regarding persistent connection look into the port rules, because port rules determine which tcpip port is handled and how.
MSDN:
When a port rule uses multiple-host
load balancing, one of three client
affinity modes is selected. When no
client affinity mode is selected,
Network Load Balancing load-balances
client traffic from one IP address and
different source ports on
multiple-cluster hosts. This maximizes
the granularity of load balancing and
minimizes response time to clients. To
assist in managing client sessions,
the default single-client affinity
mode load-balances all network traffic
from a given client's IP address on a
single-cluster host. The class C
affinity mode further constrains this
to load-balance all client traffic
from a single class C address space.
In an asp.net app what allows session state to be persistent is when the clients affinity parameter setting is enabled; the NLB directs all TCP connections from one client IP address to the same cluster host. This allows session state to be maintained in host memory;
The client affinity parameter makes sure that a connection would always route on the server it was landed initially; thereby maintaining the application state.
Therefore I believe, same would happen for your windows based multi threaded app, if you utilize the affinity parameter.
Network Load Balancing Best practices
Web Farming with the
Network Load Balancing Service
in Windows Server 2003 might help you give an insight
Concurrency (Check out Apache Cassandra, et al)
Speed of light issues (if going cross-country or international you'll want heavy use of transactions)
Backups and deduplication (Companies like FalconStor or EMC can help here in a distributed system. I wouldn't underestimate the need for consulting here)

Resources