Resources and tools for TCP tuning - windows

I'm developing an application where, to satisfy the performance requirements, tuning of low-level network stuff (such as TCP window size etc) seems to be required.
I found the magnitude of my knowledge to be a bit better than "there's TCP and there's UDP", which is far from enough for this task.
What resources might I study to get a better knowledge of which aspects of TCP influence which performance characteristics in which usage scenarios (for example, how to decrease latency while transmitting 100kb packets to 1000 clients simultaneously on a 10gbit LAN), etc.?
What tools might help me? (I already know about Wireshark, but most probably I am not using it to its full potential)

First understand what you're doing:
Then understand how to look at what you're about to change:
Then understand how to change things:
Since you're currently on a windows platform and you mention that you're "developing an application"... You probably also want to know about I/O Completion Ports and how you maximise data flow whilst conserving server resources using write completion driven flow control. I've written about these quite a bit and the last link is a link to my free high performance C++ client/server framework which may give you some pointers as to how to use IOCP efficiently.
Microservices and low latency transport

I know only two popular transport protocols in micro-services world: REST/HTTP and AMQP.
And I sense two problems with that:
Do not you think they are pretty slow? If you disagree with that claim (yes, yes, I have no benchmark about AMQP, although HTTP is widely considered as a slow one, you can find in internet articles without my help), then I can tell you that with a scarce choice of 2 you always can imagine a lot of faster protocols are not represented. 2 is a very small number, meaning, in practice - no choice.
HTTP looks like not intended to be a server-to-server protocol, but widely used in this role.
What you think about that and can you suggest some alternative (supported by frameworks, I mean something that I do not need write from scratch myself)?
It all depends on your domain scenario, its requirements and how much you can invest into the development for a lower latency, smaller bandwidth, etc.
Today there is a whole spectrum of options for server communication. Https just happens to be the most common one and good enough for a lot of applications.
Given you have both ends of the communication under control, nothing prevents you from investing more effort and building your own binary protocol based on a UDP socket or go even lower in the OSI layers. For example Google is using QUIC and has proposed to make it a successor to http/2. So http/3 may actually become a lot more efficient.
Or you can try to implement existing standards that are more optimized for latency and even real time applications. One example from the industrial domain is profinet.
A lot of times the payloads are what creates slow connections though. JSON is a great example for a format that takes a lot of time to de-/serialize in large quantities. And to improve that you can use a different transport format, for example flat buffers (another google invention) from the gaming domain.
In general if you do some research about how networking is done in gaming you will find a lot interesting technologies.
First, please isolate architectural topics from implementational topics. One side is architecture and the other side is implementation. Microservices Architecture is talking about a new paradigm in SOA. Now in the implementation phase, you can use several protocols to implement your microservice size service. You can use UDP, TCP, HTTP, etc.
The HTTP protocol used widely in microservices where there are certain concerns like statelessness, this does not necessarily mean that all microservices in implementation phase need to use HTTP. They may/could use HTTP or any other transport protocols like UDP, or even CoAP.
Bypassing the TCP-IP stack

I realise this is a somewhat open ended question...
In the context of low latency applications I've heard references to by-passing the TCP-IP stack.
What does this really mean and assuming you have two processes on a network that need to exchange messages what are the various options (and associated trade-offs) for doing so?
Typically the first steps are using a TCP offload engine, ToE, or a user-space TCP/IP stack such as OpenOnload.
Completely skipping TCP/IP means usually looking at InfiniBand and using RDMA verbs or even implementing custom protocols above raw Ethernet.
Generally you have latency due to using anything in the kernel and so user-space mechanisms are ideal, and then the TCP/IP stack is an overhead itself consider all of the layers and the complexity that in can be arranged: IP families, sub-networking, VLANs, IPSEC, etc.
This is not a direct answer to your question but i thought it might give you another view on this topic.
Before trying to bypass TCP-IP stack I would suggest researching proven real-time communication middleware.
One good solution for real-time communication is Data Distribution Service from OMG (Object Management Group)
DDS offers 12 or so quality attributes and has bindings for various languages.
It has LATENCY_BUDGET ,TRANSPORT_PRIORITY and many other quality of service attributes that makes data distribution very easy and fast.
Check out an implementation of DDS standard by PrismTech. It is called OpenSplice and
works well at LAN scale.
Depends on the nature of your protocol really.
If by low-latency applications you mean electronic trading systems, than they normally use IP or UDP multi-cast for market data, such as Pragmatic General Multicast. Mostly because there is one sender and many receivers of the data, so that using TCP would require sending copies of the data to each recipient individually requiring more bandwidth and increasing the latency.
Trading connections traditionally use TCP with application-level heartbeats because the connection needs to be reliable and connection loss must be detected promptly.

The Application Split Challenge - fast+easy RPC technology?

the following tries to get an idea of which technologies would be suitable for a specific (as outlined) distributed/RPC problem. If something is not clear, am am very happy to add more details, but please request these in a comment and not in an "answer". Thanks.
First I will describe the current situation, and then follows what we want to achieve and the actual question. Despite this being a rather long post to get some context, the question itself is rather short (see at the end).
The Application Split challenge
Application description:
The app allows the user to configure a number of hardware devices(*)
and then communicate with these to control and collect measurement
channels of a physical experiment.
(*) Hardware devices include temperature sensors, pressure sensors,
motors, ... Communication ranges from serial port communication,
TCP/UDP communication to interfacing with the drivers of 3rd party
plugin cards.
Control involves sending commands to the various hardware devices
to configure them according to the protocols they support.
Measuring involves getting the data from (some of) these devices.
We are hard pressed to keep the whole thing running as customers
demand more and more channels at higher sample rates and we have to
keep up with writing the data+timestamps we get from all devices to
disk, display a subset of the data and still keep the system
responding properly.
Current situation:
[ DisplayAndControl.exe ]
|| /\
|| DLL Interface ||
|| || Window Messages (SendMessage, PostMessage)
|| ||
\/ ||
[ ChannelManager.dll ]
ChannelManager.dll (Native C++ DLL on Windows)
Manages n data channels (physical measurement variables)
Each channel holds a shifting arbitrary number of samples with
high-precision timestamps
Allows to group channels and write their ongoing updates or
historical values ("measurement") to disk
Calculations with channels (arithmetic, integration, mean
values, etc.)
Interfaces with (realtime) hardware devices to get the timestamps
and values of channels
Get value+timestamp from hardware and save in internal
ring buffer for channel
DisplayAndControl.exe (Native C++ MFC App on Windows)
Control the functions of ChannelManager.dll (configure channels
and HW devices)
Live display current values/timestamps/changes of all channels
Graph values of (groups of) channels in diagrams
print diagrams and tables of channel values
Summary of current situation:
The application as it is at the moment is already somewhat modular
in that the (main) executable does the display+interaction and the
(one of several) DLL does the data management (saving of live data
to disk, communication with devices, etc.)
From a performance POV, communication btw. the display module and
the data management module is optimally performant at the moment.
New situation:
[ DisplayAndControl.exe ]
|| /\
|| ? RPC/Messaging ||
|| || ? RPC/Messaging
|| ||
\/ ||
[ ChannelManager.exe (same PC or another) ]
Summary of the envisioned new situation:
For usability, performance and safety reasons, we wish to split up
this Windows app into two separate applications, so that the
performance (and safety) sensitive ChannelManager module can run as
a separate process possibly on a separate Windows PC.
Additionally, since we're already going to split this, we will
allow for multiple DisplayAndControl.exe apps connected to one
single ChannelManager.exe.
One QUESTION now is what technology we should use to facilitate the
communication btw. the now two (or, rather, 1 : small_n) applications.
Performance is important, because a lot of data travels btw. the
two applications and latency should be kept to a minimum. It "only"
needs to work on Windows, but it should be usable from native C++
only which makes all purely .NET based technologies unattractive.
(Note: Porting parts of DisplayAndControl.exe to .NET/WPF is
planned, but ChannelManager.exe should stay pure native, as we
don't want any .NET stuff running inside this process.)
Regarding latency: It is important that we achieve some level of
soft-realtime in the sense that small latency is acceptable, but
large and especially varying latency is not acceptable for usability
and safety reasons. Therefore any protocol that would help in
getting some sort of (soft) realtime behavior would be preferred.
RPC technologies we've looked at:
WCF (or .NET remoting) - Is dotnet only, therefore not
attractive. Performance figures are also not very good.
(D)COM - COM is great for Windows RPC communication, but it
breaks down once you have to have inter-PC comm because it is
horrible to get the security settings working in a corporate IT
CORBA - We have had good experience with CORBA communications in
the past. The communication is easy to get working; there's not
much infrastructure overhead; it works well from C++; writing
a .NET wrapper is pretty trivial. The problem with CORBA is that
it's somewhat complicated to use correctly in C++ (people will use
a lot of time on chasing memory leaks, esp. inexperienced C++
devs). It also will be a learning curve for every developer and
every new developer, as no one expects people to "know" CORBA
nowadays. Also, it might not perform as well as we'd like it to and
as far as I know there's no readily available realtime support.
Thrift - still looks half-baked to use in our scenario.
ICE (from ZeroC) - I would prefer ICE over CORBA anytime, after all
it promises to be a "better CORBA" and I think it does deliver on
that. However, their licensing policy is very suboptimal as they do not sell development licenses but only license per
installation. (Well that's what they told us last time we asked end of 2009.) Their licensing policy also suggests that any 3rd party possibly interested in interfacing with our modules would first have to negotiate a license contract with ZeroC too.
Open MPI - Message Passing interface seems to be targeted at
scenarios with lots of clients "heavily" distributed. Doesn't seem
to fit our problem.
Writing our own communication layer using TCP/UDP - Oh my. I'd
rather not :-)
Google Protocol Buffers - Is not an RPC technology.
Distributed Shared Memory - Well. This got thrown in by a few
devs and I for one am neither sure if there's a working
implementation nor if it fit's our problem.
So again the QUESTION - what "RPC"-like technology would you prefer
in this situation and why?
I can elaborate on Johnny's answer. CORBA provides a robust infrastructure with services that go far beyond simple RPC. As your distributed application grows, you can use CORBA features to manage the mapping between interface and implementation, to provide secure connections, etc. As an RPC, CORBA provides the means for easy synchronous or asynchronous invocations.
The learning curve isn't that steep either. While some of the terms are a little arcane, the concepts such as managed (counted) references should be familiar to today's C++ programmers. And when the C++0x mapping is available, it will be even easier. Training is available to help make this transition even easier.
You mentioned not knowing about realtime support. In fact, CORBA for C++ has rich RT support. There is a RT CORBA specification and several C++ ORBs that implement it. TAO, which is open source and commercially supported, has extensive RT support, including the RT_ORB, RT_POA, an TAO-Specific RT Event service. With these tools you are able to designate priority levels for threads in the ORB, and have separate communication channels for different priority levels.
I'd suggest taking a look at Thrift. While it looks half-baked, I believe it's only the documentation that's lacking - the implementation is quite solid.
CORBA should perform well and there are people with experience. We realize that the IDL to C++ mapping is hard to use, there is a RFP from the OMG asking for a new IDL to C++0x mapping, that should make it much easier to use

Is perl the fastest way to write a high performance page?

I was inspired by Slashdot, I was heard that it uses very limited servers to support a lot of users with fast response. And there is a website named slashcode, not sure if slashdot uses its source code.
I am wondering if Perl is the best to write a high performance web page? I know using Apache or IIS will be having a lot of overhead?
Any idea, books, papers, tutorials?
I'm going to assume that by "high performance" you mean both in the real time taken to produce a page and also how many it can serve concurrently.
The programming language isn't so important as your servers and algorithms. You may want to look into The C10k Problem which is a series of new technologies and refinement of techniques with the aim to allow a single web server to concurrently handle more than 10,000 concurrent connections. Things like the Nginx and lighttpd web servers and varnish cache came out of this project.
Big wins come from using a very light, very fast, very modular web server (Apache and IIS ain't it) with a very light, very fast cache in front of it to avoid having to process the same thing twice. For a high concurrency server, even caching for a few seconds can save you hundreds or thousands of processes. By chopping up a static page into a series of AJAX requests you can cache the more static bits and pieces independently of the bits that change frequently.
Instead of using mod_blah that embeds your program into a web server, use FastCGI or similar that puts your programs into their own little application servers. This allows them to run independent of the web server, possibly on remote machines and with load balancing. This lets you easily scale your processing power.
Eventually you're going to micro-optimize really important bits of your application code to the point where the language matters, but you can focus on the really important bits rather than having to do the whole project solely according to raw performance.
Regardless of how fast your code is, at some point the bottleneck will stop being your code, and start being the web server itself.
As long as you're not using the CGI interface[1] to talk to the web server, the language isn't going to have a noticeable impact on performance in 99% of cases. The exceptions are those in which you're doing heavy back-end processing rather than simply grabbing something out of a database, lightly massaging it, and sending it off to the user - and, if you are doing that kind of thing, you're likely better off doing it asynchronously if possible and stuffing the results into a database to be lightly massaged and viewed later.
The reason is, quite simply, that network connection and data transfer times will be so much longer than your program's execution time that it's not even funny. If it's taking 2 seconds to establish a network connection to the server and do the data transmission in each direction, nobody is going to care whether the processing on the server adds 0.1s or 0.2s on top of that 2s of network activity.
[1] Note that I am talking here about the vanilla CGI "start up a new process to service each incoming request" model, not the Perl CGI module ( CGI). There are ways to use CGI while also making use of a long-lived process which handles multiple requests over its lifetime.
Architecture and system design are more important than language choice for a high traffic app.
But selecting a language is not the first thing you should do, unless you are planning to write everything from the ground up.
You should be selecting a toolset.
If you want to have something soonest, look at existing web applications. What meets your needs? How customizable is it? Does it meet your performance/scalability requirements? If so, the language you use will be the language your app uses.
If you can't find a good match in existing apps, look at different frameworks, Catalyst, Rails, Squatting, Camping, Jifty, Django. There's a nice list of them on Wikipedia.
You should be able to find a framework that will do the job, many of them. Pick some contenders and choose one. The language you use will be the language your framework uses.
There's really no such thing as a "high performance page". That's like asking what the fastest car is (and if you watch enough Top Gear, you know that's not a simple answer). You have to think about what you actually want to do (i.e. the particular task), what you have to do to make that happen, and which tools would work best for that.
Are you going to have a lot of people doing a lot of small things, or fewer people doing really big things? Is it all going to happen at once (i.e. spikes), or is it going to be constant demand? Are you send back small chunks of data or serving up really large files?
Suppose that every portion were as fast as possible. It's a fantasy for sure, but consider it anyway. Now that everything is fast as possible, rank every part according to how relatively fast they are. What's the slowest part? Is it disk access? Network IO? Socket availability?
If you aren't at the point where you're already thinking about this, the language probably isn't that important beyond your skill with it.
There are a lot of books on web performance out there. :)
This post on serverfault suggestst that you could write an extension module to nginx for serving dynamic content.
Such modules need to be compiled to native machine code, so most likely are faster than running Perl.
I don't believe it would be faster than other common choices such as PHP, Python, Ruby, Java, or C#.

How do CPG of Corosync, ZeroMQ, and Spread compare for messaging?

I'm interested in:
Resource usage (CPU, memory, ...)
High availability
No single point of failure
Transport options
Routing options
Active development
Widely used
Helpful mailing list, forum, IRC channel, ...
Ease of integration with my current codebase
Gotchas maybe
Any other thing you think I omitted
I've read about them, but I couldn't find a good comparison. Specially I'm interested in performance benchmarks comparing them. (Maybe I should do one on my own! I hope not.)
Well, I haven't used the other two, but can share my experiences with ZeroMQ. In my opinion, it excels at all of yours.
Speed and throughput
It's as fast as TCP, doesn't use CPU or a lot a memory. It can push A LOT of messages very quickly without a sweat. It will saturate your network channel way before you run out of memory (I doubt you'll ever be able to max-out the CPU). There was a comparison to RabbitMQ somewhere and ZMQ outperforms it by a factor of 2. From things I've read around the web it's in use in high speed trading.
RabbitMQ is also a very good tool. Have a look at it - it might be good fit for what you are looking
If you design you application properly, then you can have no single point of failure. It's very easy to connect two sockets to another one. So if one of them fails - the other is there to handle the work. There are things like High water marks to help you along the way. Read the ZeroMQ Guide to learn how to design your app without a SPOF.
Transports and routing
Regarding transport options (if I'm understanding this correctly) - it's up to you to define your protocol. ZeroMQ basically promises you that it will deliver this blob of data to the other end. Use JSON, Protocol buffers, Morse code, whatever you like.
There is no built-in routing in like there is in AMQP. Again, it up to you to specify which ZeroMQ socket connects to which, but this is very easy.
I've been developing with it for a few months (using Python) and haven't found a single issue with its stability. Even when I try to use it the wrong way it just throws a nice error telling me not to do that. Even restarting/killing some of the services and bringing them back up doesn't cause any problems. I'd say it a very stable piece of software.
As a note: always use the latest version - the 2.1 version is very much stability oriented, so many stability issues are resolved in it.
Bindings for more than 20 languages, active mailing list, very good documentation, frequent releases. Anything else?
Because it's designed as a library it's up to you to design you application (unlike the case with a framework) and it pretty much stands out of your way. It feels a bit like a normal TCP socket, much more powerful and easier to use (it guarantees you that a message will be delivered as a whole, not only the first 128 bytes and the rest later as it the case with regular sockets).
There are some, but they are all documented in the guide. (For example: you might miss the first few messages from a PUB socket when you connect (SUB) to it. There is an explanation to this in the guide and a recipe how to handle it).
I find this one of the best designed pieces of software - stable, well written, well documented and doesn't stand in my way.
I recommend you to read the guide end-to-end. It's well written, examples in a lot of languages (including C++) and it describes a lot of edge cases and pain points.
