ZMQ throughput optimization - c++11

I developed an application that has very various zmq-message sizes. In average those are ~177 byte, but in reality most messages are very small < 20b and just few messages have very big size > 3000b.
Now the network is the limiting factor (1gbit ethernet). I can reach ~50MByte/s. Another benchmark told me that the network throughput can reach ~85 MByte/s with a paket size of >256byte.
I think my results are that low due to the fact that most pakets have very small size. Am I right? Is there a possiblity to optimize zmq for using the whole bandwidth for my application, too? Extended batching for example?
Regards

The ZeroMQ guide illustrates the Black Box Pattern for high speed subscribers. In essence, it uses a two stream approach (per node), where each stream has it own I/O thread and subscriber, both of whom are bound to a specific network interface (NIC) and core, so you'll need two network adapters and multi-cores per node for this to work. You can read the full details in the guide.

Related

What is the relationship between request content size and request duration

At the company I work, all our APIs send and expect requests/responses that follow the JSON:API standard, making the structure of the request/response content very regular.
Because of this regularity and the fact that we can have hundreds or thousands of records in one request, I think it would be fairly doable and worthwhile to start supporting compressed requests (every record would be something like < 50% of the size of its JSON:API counterpart).
To make a well informed judgement about the viability of this actually being worthwhile, I would have to know more about the relationship between request size and duration, but I cannot find any good resources on this. Anybody care to share their expertise/resources?
Bonus 1: If you were to have request performance issues, would you look at compression as a solution first, second, last?
Bonus 2: How does transmission overhead scale with size? (If I cut the size by 50%, by what percentage will the transmission overhead be cut?)
Request and response compression adds to a time and CPU penalty on both sender's side and receiver's side. The savings in time is in the transmission.
The weighing of the tradeoff depends a lot on the customers of the API -- when they make requests, how much do they request, what is requested, where they are located, type of device/os and capabilities etc.,
If the data is static -- for eg: a REST query apihost/resource/idxx returning a static resource, there are web standard approaches like caching of static resources that clients / proxies will be able to assist with.
If the data is dynamic -- there are architectural patterns that could be used.
If the data is huge -- eg: big scientific data sets, video etc., almost always you would find them being served statically with a metadata service that provides the dynamic layer. For eg: MPEG-DASH or HLS is just a collection of files.
I would choose compression as a last option relative to the other architectural options.
There are also implementation optimizations that would precede using compression of request/response. For eg:
Are your services using all available resources at disposal (cores, memory, i/o)
Does the architecture allow scale-up and scale-out and can the problem be handled effectively using that (remember the penalties on client side due to compression)
Can you use queueing, caching or other mechanisms to make things appear faster?
If you have explored all these and the answer is your system is optimal and you are looking at the most granular unit of service where data volume is an issue, by all means go after compression. Keep in mind that you need to budget compute resources for compression on the server side as well (for a fixed workload).
Your question#2 on transmission overhead vs size is a question around bandwidth and latency. Bandwidth determines how much you can push through the pipe. Latency governs the perceived response times. Whether the payload is 10 bytes or 10MB, latency for a client across the world encountering multiple hops will be larger relative to a client encountering only one or two hops and is bound by the round-trip time. So, a solution may be to distribute the servers and place them closer to your clients from across the world rather than compressing data. That is another reason why compression isn't the first thing to look at.
Baseline your performance and benchmark your experiments for a representative user base.
I think what you are weighing here is going to be the speed of your processor / cpu vs the speed of your network connection.
Network connection can be impacted by things like distance, signal strength, DNS provider, etc; whereas, your computer hardware is only limited by how much power you've put in it.
I'd wager that compressing your data before you are sending would result in shorter response times, yes, but it's=probably going to be a very small amount. If you are sending json, usually text isn't all that large to begin with, so you would probably only see a change in performance at the millisecond level.
If that's what you are looking for, I'd go ahead and implement it, set some timing before and after, and check your results.

Calculate optimal package size (Optimal Payload ) for maximize network speed

I'm streaming tons of bytes form one PC to web application (javascript) (running on other machine/pc/mobile) through websocket.
Per my knowledge, there is a need to split the data into packages to maximize the network performance. My question is:
how I can obtain the payload size (optimal size) in real time to get the maximum speed!
is this different if I want to use WebRTC?
You should send as many bytes at once as you can. This not only optimizes sending on the network but also the software stack on the sending machine. As for the packet size, you'll find that about 1300-1400 bytes is a good size on most systems and networks, because it's a bit less than one "MTU".
Per my knowledge, there is a need to split the data into packages to maximize the network performance
There is indeed, but the TCP layer does that for you. The best you can do is provide TCP with as much data as possible, as quickly as possible, so it has the maximum choice.
my question is: 1. how I can obtain the payload size (optimal size) in real time
You can't.
to get the maximum speed?
You can't.
is this different if I want to use WebRTC?
No.

How to utilize all available bandwidth with real-time data?

How to measure actual bandwidth between server and client to decide how much of real-time data to send?
My server sends read-time data to clients, 30 times per second. If server has too much data it prioritises data chunks and throws away anything that doesn't fit into available bandwidth because this data will be invalidated next tick anyway. Data is sent over reliable (20%) and unreliable channels (80%) (both UDP based but if TCP as a reliable channel can provide any benefit please let me know). Data is highly latency-sensitive. Server often (but not always!) has more data than available bandwidth. It's critical to send as much data as possible but not more than available bandwidth to avoid packets drop or higher latency.
Server and client are custom applications so can implement any algorithm/protocol.
My main problem is how to keep track of available bandwidth. Also any statistical info about typical bandwidth jitter would be helpful (servers are in a cloud, clients are home users, worldwide).
At the moment I'm thinking how to utilize:
latency info of reliable channel. It should correlate with bandwidth because if latency grows this can (!) mean retransmission is involved as result of packets drop and so server must lower data rate.
data amount received by client on unreliable channel during time frame. Especially if data amount is lower than what was sent from server.
if current latency is close to or below lowest recorded one, bandwidth can be increased
The problem is that this approach is too complicated and involves a lot of "heuristics" like what should be a step to increase/decrease bandwidth etc.
Looking for any advice from people who dealt with similar problem in the past or just any bright ideas
The first symptom of trying to use more bandwidth than you actually have will be increased latency, as you fill up the buffers between the sender and whatever the bottleneck is. See https://en.wikipedia.org/wiki/Bufferbloat. My guess is that if you can successfully detect increased latency as you start to fill up the bandwidth and back off then you can avoid packet loss.
I wouldn't underestimate TCP - people have spent a lot of time tuning its congestion avoidance to get a reasonable amount of the available bandwidth while still being a good network citizen. It may not be easy to do better.
On the other hand, a lot will depend on the attitude of the intermediate nodes, which may treat UDP differently from TCP. You may find that under load they either prioritize or discard UDP. Also some networks, especially with satellite links, may use https://en.wikipedia.org/wiki/TCP_acceleration without you even knowing about it. (This was a painful surprise for us - we relied on the TCP connection failing and keep-alive to detect loss of connectivity. Unfortunately the TCP accelerator in use maintained a connection to us, pretending to be the far end, even when connectivity to the far end had in fact been lost).
After some research, the problem has a name: Congestion Control, or Congestion Avoidance Algorithm. It's quite a complicated topic and there're lots of materials about it. TCP Congestion Control was evolving over time and is really good one. There're other protocols that implement it, e.g. UDT or SCTP

Optimizing incoming UDP broadcast in Linux

Environment
Linux/RedHat
6 cores
Java 7/8
10G
Application
Its a low latency high frequency trading application
Receives broadcast via multicast UDP
There are multiple datastreams
Each Incoming packet size is less than 1K(fixed size)
Application latency is around 4 microsecond
Architecture
Separate application thread is mapped to each incoming multicast stream
Receives data from socket using multicastsocket.receive() in native
bytes
Bytes are parsed and orderbook is prepared
Problem
Inspite of tolerable app latency of around 4 microsec we are not able to receive desirable performance. We believe it is because of network latency.
Tuning steps used
Increased size of following parameters:
netdev_max_backlog
NIC ring buffer receive size
rmem_max
tcp_mem
socketreceivebuffer (in the code)
Question:
We observed that the performance of the application deteriorated after we increased the values of above mentioned parameters. What are the suggested parameters to be optimized & the recommended values. A guide towards optimizing incoming broadcast is requested?
Is there are a way to measure the network latency in a more accurate manner in environment like this. Note that the UDP sender is an external entity (exchange)
Thanks in advance
It is not clear what and how you measure.
You mention that you are receiving UDP, why are you tuning TCP buffer size?
Generally, increasing incoming socket buffer sizes may help you with packet loss on a slow receiver, but it will not reduce latency.
You may like to find out more about bufferbloat:
Bufferbloat is a phenomenon in packet-switched networks, in which excess buffering of packets causes high latency and packet delay variation (also known as jitter), as well as reducing the overall network throughput. When a router device is configured to use excessively large buffers, even very high-speed networks can become practically unusable for many interactive applications like voice calls, chat, and even web surfing.
You also use Java for a low-latency application. People normally fail to achieve this kind of latencies with Java. One of the major reasons being the garbage collector. See Quantifying the Performance
of Garbage Collection vs. Explicit Memory Management for more details:
Comparing runtime, space consumption, and virtual memory footprints
over a range of benchmarks, we show that the runtime performance
of the best-performing garbage collector is competitive with explicit
memory management when given enough memory. In particular,
when garbage collection has five times as much memory
as required, its runtime performance matches or slightly exceeds
that of explicit memory management. However, garbage collection’s
performance degrades substantially when it must use smaller
heaps. With three times as much memory, it runs 17% slower on
average, and with twice as much memory, it runs 70% slower. Garbage
collection also is more susceptible to paging when physical
memory is scarce. In such conditions, all of the garbage collectors
we examine here suffer order-of-magnitude performance penalties
relative to explicit memory management.
People doing HFT using Java often turn off garbage collection completely and restart their systems daily.

Measuring link quality between two machines

Are there some standard methods(libraries) for measuring quality of link/connection between two computers.
This results would be used to improve routing logic. If connection condition is unacceptable stop data transfer to that computer and initiate alternative route for that transfer. It looks like Skype has some of this functionality.
I was thinking to establish several continuous testing streams that can show bandwidth problems, and some kind of ping-pong messaging logic to show latency values.
Link Reliability
I usually use a continuous traceroute (i.e. mtr) for isolating unreliable links; but for your purposes, you could start with average ping statistics as #recursive mentioned. Migrate to more complicated things (like a UDP/TCP echo protocol) if you find that ICMP is getting blocked too often by client firewalls in the path.
Bandwidth / Delay Estimation
For bandwidth and delay estimation, yaz provides a low-bandwidth algorithm to estimate throughput / delay along the path; it uses two different endpoints for measurement, so your client and servers will need to coordinate their usage.
Sally Floyd maintains a pretty good list of bandwidth estimation tools that you may want to check out if yaz isn't what you are looking for.
Ping is good for testing latency, but not bandwidth.

Resources