How to efficiently handle thousands of keep alive connections in Go?

Using golang's net/http server to handle connections, is there a pattern to better handle 10,000 keep alive connections with relatively low requests per second each?
my benchmark performance with something like Wrk is 50,000 requests per second, and with real traffic (from realtime bidding exchanges) I have a hard time beating 8,000 requests per second.
I know connection multiplexing from a hardware loadbalancer is possible, but it seems like the same type of pattern can be achieved in Go.

You can distribute load on local and remote servers using an IPC protocol like JSON RPC through e.g. UNIX and TCP sockets.
As to the performance bottleneck; it has been discussed extensively on the go-nuts mailing list. At the time of writing it is the runtime's goroutine scheduler and world-stopping garbage collector.
The core team has recently made major improvements to the runtime to alleviate this problem yet there still is room for improvement. To quote one example:
Due to tighter coupling of the run-time and network libraries, fewer context switches are required on network operations.


I have an Nginx proxy server. When an HTTP/2 request comes to the server and does not find anything in cache, the server makes an outbound request to the origin server using HTTP/1.1. Is there a performance degradation on the server when it converts from one version of the protocol to another? How does this compare to HTTP/1.1 to Nginx and HTTP/1.1 to the origin server? Is there a way to measure the overhead?
Strictly speaking there is performance degradation, since one protocol is binary, other one textural. So proxy must convert, that takes resources, time - you can expect degradation by default.
In general however that can be much more complicated. Say your proxy is used by slow mobile connection. Who cares about a bit of conversion if your app is gaining huge bust after that conversion? Or maybe your proxy had gzip conversion for http/1.1, and that speed gain is not that big, on the other hand maybe performance degradation is not that big?
Can you measure that? Perhaps. Question is what for? I would measure something as close to real case as possible. I would automate that measurement to see where real performance is.
My only warry in your case is CPU of the proxy - just measure it to see changes over time, and setup notifications - like "if cpu is over 80% for longer then 5 min".
Other than all of that. Http2 brings two ways communication, as well as push. My assumption is that it is not your case, since you are comparing 1.1 and 2. For me I would go with http2 everywhere, unfortunately nginx is not supporting http2 and backend side. Fingers crossed to see that soon!
Yes, going from HTTP2 to HTTP 1.1 degrades performance, primarily due to protocol-imposed transport conversions. For example, you lose the following transport optimizations:
Single connection
Request/response multiplexing
Header compression
Additionally, as Michal mentioned, HTTP 1.1 messages are textual while HTTP2 messages are binary.
HTTP2 multiplexes requests and responses over a single connection. However, HTTP 1.1 only affords persistent connections and request/response pipelining, which is not even comparable. For example, pipelining forces a FIFO order of message exchanges, which causes blocking.
To achieve any similar throughput levels, the proxy will have to open a connection pool to each backend. Those pools could be large or small, but considering resource allocations, TCP handshakes, TLS handshakes, etc., per connection and you start to get the idea of how much overhead we're talking about.
Measure the difference between "throughput in" on cache hits and "throughput out" on cache misses, e.g. "protocol conversion throughput penalty" is ~23 tps. (You should also know your average cache miss penalty in terms of time.)
Key metrics
Throughput in versus throughput out
Average cache miss penalty
Cache hit and cache miss ratios
Unless your cache miss ratio is high, I wouldn't worry about this.
I don't think their is a performance degrade. One way to measure the impact (since I can't test for you, you will have to do it) is to use AJAX, send a http/1.1 request and measure the response time. Then compare it to the time it takes to send http/2 requests. Do it multiple times.
That'll help you.
But, beware, their may be a potential security problem.
That is, their will be no point in even installing an SSL/TLS certificate. So is so because, the info that the NGINX server will send will then be open to hackers. Probably.

I am working on client-server software using Microsoft RPC (over TCP) as the communication method. We sometimes transfer files from the client to the server. This works fine in local networks. Unfortunately, when we have a high latency, even a very wide bandwidth does not give a decent transfer speed.
Based on a WireShark log, the RPC layer sends a bunch of fragments, then waits for an ACK from the server before sending more and this causes the latency to dominate the transfer time. I am looking for a way to tell RPC to send more packets before pausing.
The issue seems to be essentially the same as with a too small TCP window, but there might be an RPC specific fragment window at work here, since Wireshark does not show the TCP-level window being full. iPerf connection tests with a small window do give those warnings, and a speed similar to the RPC transfer. With larger windows sizes, the iPerf transfer is three times faster than the RPC, even with a reasonable (40ms) latency.
I did find some mentions of an RPC fragment window at microsoft's site ( and in an RPC document ( search for window_size), but these seem to concern only connectionless (UDP) RPC. Additionally, they mention an RPC "fack" message and I observed only regular TCP level ACK:s in the log.
My conclusion is that either the RPC layer is using a stupidly low TCP window, or it is limiting the number of fragment packages it sends at a time by some internal logic. Either way, I need to make it send more between ACKs. Is there some way to do this?
I could of course just transfer the file over multiple simultaneous connections, but that seems more like a work-around than a solution.
PS. I know RPC is not really designed for file transfer, but this is a legacy application and the RPC pipe deals with authentication and whatnot, so keeping the file transfer there would be best, at least for now.
PPS. I guess that if the answer to this question is a configuration option, this would be better suited for SuperUser, but an API setting would be ideal, which is why I posted this here.
I finally found a way to control this. This Microsoft documentation page: Configuring Computers for RPC over HTTP contains registry settings that set the windows RPC uses, at least when used in conjunction with RPC over HTTP.
The two most relevant settings were:
HKLM\Software\Microsoft\Rpc\ClientReceiveWindow: DWORD
Making this higher (some MB:s, in bytes) on the client machine made the download to the client much faster.
HKLM\Software\Microsoft\Rpc\InProxyReceiveWindow: DWORD
Making this higher on the server machine made the upload faster.
The downside of these options is that they are global. The first one will affect all RPC clients on the client machine and the latter will affect all RPC over HTTP proxying on the server. This may have serious caveats, but a tenfold speed increase is nothing to be scoffed at, either.
Still, setting these on a per-connection basis would be much better.

I am developing a TCP Proxy to be put in front of a TCP service that should handle between 500 and 1000 active connections from the wild Internet.
The proxy is running on the same machine as the service, and is mostly-transparent. The service is for the most part unaware of the proxy, the only exception being the notification of the real remote IP address of the clients.
This means that, for every inbound open TCP socket, there are two more sockets on the server: the secondth of the pair in the Proxy, and the one on the real service behind the proxy.
The send and recv window sizes on the two Proxy sockets are set to 1024 bytes.
What are the performance implications on this? How slow is this configuration? Should I put some effort on changing the service to use Named Pipes (or other IPC mechanism), or a localhost TCP socket is for the most part an efficient IPC?
The merge of the two apps is not an option. Right now we are stuck with the two process configuration.
EDIT: The reason for having two separate process on the same hardware is 100% economics. We have one server only, and we are not planning on getting more (no money).
The TCP service is a legacy software in Visual Basic 6 which grew beyond our expectations. The proxy is C++. We don't have the time, money nor manpower to rewrite and migrate the VB6 code to a modern programming environment.
The proxy is our attempt to mitigate a specific performance issue on the service, a DDoS attack we are getting from time to time.
The proxy is open source, and here is the project source code.
It will be the same (or at least not measurably different). Winsock is smart enough to know if it's talking to a socket on the same host and, in that case, it will short-circuit pretty much everything below IP and copy data directly buffer-to-buffer. In terms of named pipes vs. sockets, if you need to potentially be able to communicate to different machines ever in the future, choose sockets. If you know for a fact that you'll never need to do that, pick whichever one your developers are most familiar or most comfortable with.
For anyone that comes to read this later, I want to add some findings that answer the original question.
For a utility we are developing we have a networking class that can use named pipes, or TCP with the same calls.
Here is a typical loop back file transfer on our test system:
TCP/IP Transfer time: 2.5 Seconds
Named Pipes Transfer time: 3.1 Seconds
Now, if you go outside the machine and connect to a remote computer on your network the performance for named pipes is much worse:
TCP/IP Transfer time: 12 Seconds
Named Pipes Transfer time: 2.5 Minutes (Yes Minutes!)
I realize that this is just one system (Windows 7) But I think it is a good indicator of how slow named pipes can be...and it seems like TCP is the way to go.
I know this topic is very old, but it was still relevant for me, and maybe others will look at this in the future as well.
I implemented IPC between Excel (VBA) and another process on the same machine, both via a TCP connection as well as via Named Pipes.
In a quick performance test, I submitted a message than consisted of 26 bytes from client (Excel) to server (not Excel), and waited for the reply message from the other process (which consisted of 12 bytes in the example).
I executed this a ton of times in a loop and measured the average execution time.
With TCP on localhost (Windows 7, no fastpath), one "conversation" (request+reply) took around 300-350 microseconds. Especially sending data was quite slow (sending the 26 bytes took around 200microseconds via TCP).
With Named Pipes, one conversation took around 60 microseconds on average - so a LOT faster.
I'm not entirely sure why the difference was so large. The corporate environment I tested this in has a strict firewall, package inspections and what not, so I THINK this may have been caused as even the localhost-based TCP connection went through security measures significantly slowing it down, while named pipe ones likely did not.
TL:DR: In my case, Named Pipes were around 5-6 times faster than TCP for small packages (have not tested with bigger ones yet)
Let me sum it up for you. If you are worried about performance then use TCP/IP. But if you have a really fast network and your not worried about performance then Named Pipes would be "neat" in that it might save you some code.
Not to mention, if you stick to TCP then you will have something that can be scaled, and even load balanced when the time comes.
In the scenario you describe, the local TCP connections are very unlikely to be a bottleneck. It will introduce some overhead, of course, but this should be negligible unless your CPU is already running hot.
At a guess, if your server's CPU usage is normally below 50% or so (with the proxy in place) it isn't worth worrying about minimizing the overhead associated with the local TCP connections.
If CPU usage is regularly above 80% you should probably be doing some profiling. I'd start by comparing the CPU load (or, better still, the performance, if you can measure it meaningfully) when the proxy is in place to when it isn't. Unless the proxy is doing some complicated processing, the overhead associated with the extra TCP connections is probably a significant fraction of the total overhead introduced by the proxy, so that should give you at least an order-of-magnitude estimate of how much you'd gain by using a more efficient form of IPC.
What is the reason to have a proxy on the SAME machine, just curious?
There are several methods for IPC, TCP/IP, named Pipes are comparable in speed and complexity. If you really want something that scales well and has almost no overhead: use shared memory. Best used in combination with a lock free algorithm for advancing the pointers (or use one buffer for each reader (the proxy/the service) and writer(the service/the proxy)).

I am trying to run some MSMQ performance test on a Win2008r2 server. Ideally, I would like to simulate several thousands of workstations sending (each of them) 5 msgs/sec.
One way to do so is to work with Amazon but I am wondering if this could be done in other ways.
I taught that using a custom tool which sends a large number of msgs on a single workstation could do the job but it seems that they are some internal mechanisms which affects a true real life representation of what I am trying to simulate. I can send 2000msgs/sec on a workstation but because of the OUTGOING queue and other mechanism, the server seems to swallow the whole things in large chunks of data (and I am noticing only, at best, 50msgs/sec peaks)
I believe there must be some kind of overhead operation from the workstation before sending data to server which I loose by simulating only a single workstation (or a few more).
Any ideas ?
P.S. I am using a private transactional queue. Win7 on workstation. MSMQ 5
Simulating throughput is easy but it is incredibly hard to simulate multiple MSMQ clients. Each client has a unique IP address and it's own client-queue-manager-to-server-queue-manager network connection. Using Amazon to generate a large number of instances of a Windows client would do the trick but I haven't seen any solution that works on standard PCs.
The overhead you can't simulate with just sending lots of messages is the kernel memory used by the network connection and the threads used to handle incoming messages. Network connections are very expensive and eventually the server will fail if you have too many simultaneously connected clients.
As you are continuously sending messages, each client will have a persistent connection to the server. This is good for speed but bad for memory/thread usage. 5,000 clients will require a powerful 64-bit server.
So, what's the limit on connections to an MSMQ server?
"What are MSMQ's limits?" If I had a farthing for every time...
Insufficient Resources? Run away, run away!
FIX: Kernel-pool memory may become exhausted when many clients connect to Message Queuing

I've to move a Windows based multi-threaded application (which uses global variables as well as an RDBMS for storage) to an NLB (i.e., network load balancer) cluster. The common architectural issues that immediately come to mind are
Global variables (which are both read/ written) will have to be moved to a shared storage. What are the best practices here? Is there anything available in Windows Clustering API to manage such things?
My application uses sockets, and persistent connections is a norm in the field I work. I believe persistent connections cannot be load balanced. Again, what are the architectural recommendations in this regard?
I'll answer the persistent connection part of the question first since it's easier. All good network load-balancing solutions (including Microsoft's NLB service built into Windows Server, but also including load balancing devices like F5 BigIP) have the ability to "stick" individual connections from clients to particular cluster nodes for the duration of the connection. In Microsoft's NLB this is called "Single Affinity", while other load balancers call it "Sticky Sessions". Sometimes there are caveats (for example, Microsoft's NLB will break connections if a new member is added to the cluster, although a single connection is never moved from one host to another).
re: global variables, they are the bane of load-balanced systems. Most designers of load-balanced apps will do a lot of re-architecture to minimize dependence on shared state since it impedes the scalabilty and availability of a load-balanced application. Most of these approaches come down to a two-step strategy: first, move shared state to a highly-available location, and second, change the app to minimize the number of times that shared state must be accessed.
Most clustered apps I've seen will store shared state (even shared, volatile state like global variables) in an RDBMS. This is mostly out of convenience. You can also use an in-memory database for maximum performance. But the simplicity of using an RDBMS for all shared state (transient and durable), plus the use of existing database tools for high-availability, tends to work out for many services. Perf of an RDBMS is of course orders of magnitude slower than global variables in memory, but if shared state is small you'll be reading out of the RDBMS's cache anyways, and if you're making a network hop to read/write the data the difference is relatively less. You can also make a big difference by optimizing your database schema for fast reading/writing, for example by removing unneeded indexes and using NOLOCK for all read queries where exact, up-to-the-millisecond accuracy is not required.
I'm not saying an RDBMS will always be the best solution for shared state, only that improving shared-state access times are usually not the way that load-balanced apps get their performance-- instead, they get performance by removing the need to synchronously access (and, especially, write to) shared state on every request. That's the second thing I noted above: changing your app to reduce dependence on shared state.
For example, for simple "counters" and similar metrics, apps will often queue up their updates and have a single thread in charge of updating shared state asynchronously from the queue.
For more complex cases, apps may swtich from Pessimistic Concurrency (checking that a resource is available beforehand) to Optimistic Concurrency (assuming it's available, and then backing out the work later if you ended up, for example, selling the same item to two different clients!).
Net-net, in load-balanced situations, brute force solutions often don't work as well as thinking creatively about your dependency on shared state and coming up with inventive ways to prevent having to wait for synchronous reading or writing shared state on every request.
I would not bother with using MSCS (Microsoft Cluster Service) in your scenario. MSCS is a failover solution, meaning it's good at keeping a one-server app highly available even if one of the cluster nodes goes down, but you won't get the scalability and simplicity you'll get from a true load-balanced service. I suspect MSCS does have ways to share state (on a shared disk) but they require setting up an MSCS cluster which involves setting up failover, using a shared disk, and other complexity which isn't appropriate for most load-balanced apps. You're better off using a database or a specialized in-memory solution to store your shared state.
Regarding persistent connection look into the port rules, because port rules determine which tcpip port is handled and how.
When a port rule uses multiple-host
load balancing, one of three client
affinity modes is selected. When no
client affinity mode is selected,
Network Load Balancing load-balances
client traffic from one IP address and
different source ports on
multiple-cluster hosts. This maximizes
the granularity of load balancing and
minimizes response time to clients. To
assist in managing client sessions,
the default single-client affinity
mode load-balances all network traffic
from a given client's IP address on a
single-cluster host. The class C
affinity mode further constrains this
to load-balance all client traffic
from a single class C address space.
In an app what allows session state to be persistent is when the clients affinity parameter setting is enabled; the NLB directs all TCP connections from one client IP address to the same cluster host. This allows session state to be maintained in host memory;
The client affinity parameter makes sure that a connection would always route on the server it was landed initially; thereby maintaining the application state.
Therefore I believe, same would happen for your windows based multi threaded app, if you utilize the affinity parameter.
Network Load Balancing Best practices
Web Farming with the
Network Load Balancing Service
in Windows Server 2003 might help you give an insight
Concurrency (Check out Apache Cassandra, et al)
Speed of light issues (if going cross-country or international you'll want heavy use of transactions)
Backups and deduplication (Companies like FalconStor or EMC can help here in a distributed system. I wouldn't underestimate the need for consulting here)
