Reducing the CPU impact of producers on RabbitMQ - performance

All,
I've been doing performance testing with RabbitMQ on a mid-grade desktop machine (5th-gen i3) and I've found that while my RabbitMQ can handle dozens of exchanges and even 100-200 queues & consumers fairly well, as soon as I start increasing the # of producers beyond ~30 the CPU usage very quickly goes to 100%. RAM usage is always acceptable (low hundreds of MB).
I'm sending various messages with sizes ranging from 400 bytes to 40kBytes, and interestingly this seems to have only a modest effect on CPU (just RAM).
I've been playing with messaging rates from my producers of between 1/second and 100/second, and this certainly has an effect on CPU but not nearly as much as the # of consumers. For example, 10 producers at 10msg/second is MUCH less of a load than 100 producers generating 1msg/second.
When CPU usage hits the wall I don't see any other red/yellow flags like max. # of Erlang processes, or file descriptor limits, in the RabbitMQ Admin console.
I'm currently using Python3 + Pika 0.10 and the BlockingConnection and Basic_Publish, and no specified delivery type for each of my producers (i.e. not durable). Is the highly asymmetric (and heavy) load from the producer side expected behavior?
Are there techniques I can use to reduced the load my producers put on my RabbitMQ instance?
Thanks and Regards

Related

Kernel tuning for service terminating millions of TCP connections

I just got up and running a service that at its peak needs to handle simultaneous TCP connections in the tens of millions. It's currently running without much tuning by just scaling up to a large number of hosts. The software itself is written in Netty and doesn't do much except translating data frames coming over WebSocket pipes into Kafka events.
My current goal is to be able to pack as many connections on a single machine as possible. I've currently settled for EC2 r6i.2xlarge instances which have 8 CPUs and 64GB of memory and I'm looking for some advice on kernel network stack and Netty tuning.
Some stats on WebSocket traffic patterns:
Each client sends a WebSocket data frame about once per 10 seconds.
Data frames are less than 32KB in size and most are less than 4KB.
We can have sudden bursts of a few million connections in a matter of seconds (various competitions/events).
Many connections are quite short and by far the most common data is actually a TCP accept followed by a login data frame followed by a connection close a few tens of seconds later.
From the above we see that the bitrate per TCP connection is less than 1KB/s and the connections are mostly idle. However, on the backend side we push events to Kafka in batches, so those are a much smaller number of sockets pushing lots of data each.
I've currently increased the ulimit and tcp_max_orphans flags to about 10 million each, since I assumed that both of these would be an issue.
Anyone familiar with the TCP/IP stack internals that would have some advice on what the most important tunables to look into would be?
My own starting point would be to limit the amount of memory that each socket uses as well as increase the amount of memory available to the TCP/IP stack. However, the math here is not very clear from the docs, i.e. how the different flags relate, since I don't know exactly how much memory a single TCP connection consumes inside the kernel.
Some concrete questions:
What options to use in Netty for these frontend WebSocket TCP connections given the traffic patterns described?
How to minimize the amount of memory used per socket in the kernel as well as how to calculate/set kernel memory limits from there?
Anything else worth looking into?

Reaching limits of Apache Storm

We are trying to implement a web application with Apache Storm.
Applicationreceives a huge load of ad-requests (100 TPS - a hundred transactions / second ),makes some simple calculation on them and then stores the result in a NoSQL database with a maximum latency of 10 ms.
We are using Cassandra as a sink for its writing capabilities.
However, we have already overpassed the 8 ms requirement, we are in 100ms.
We tried to minimize the size of buffers (Disruptor buffers) and to well balance the topology, using the parallelism of bolts.
But we still in 20ms.
With 4 worker ( 8 cores / 16GB ) we are at 20k TPS which is still very low.
Is there any suggestions for optimization orare we just reaching the limits of Apache Storm(limits of Java)?
I don't know the platform you're using, but in C++ 10ms is eternity. I would think you are using the wrong tools for the job.
Using C++, serving some local query should take under a microsecond.
Non-local queries that touch multiple memory locations and/or have to wait for disk or network I/O, have no choice but taking more time. In this case parallelism is your best friend.
You have to find the bottleneck.
Is it I/O?
Is it CPU?
Is it memory bandwidth?
Is it memory access time?
After you've found the bottleneck, you can either improve it, async it and/or multiply (=parallelize) it.
There's a trade-off between low latency and high throughput.
If you really need to have high throughput, you should rely on batching adjusting size of buffers bigger, or using Trident.
Trying to avoid transmitting tuples to other workers helps low latency. (localOrShuffleGrouping)
Please don't forget to monitor GC which causes stop-the-world. If you need low-latency, it should be minimized.

Optimizing incoming UDP broadcast in Linux

Environment
Linux/RedHat
6 cores
Java 7/8
10G
Application
Its a low latency high frequency trading application
Receives broadcast via multicast UDP
There are multiple datastreams
Each Incoming packet size is less than 1K(fixed size)
Application latency is around 4 microsecond
Architecture
Separate application thread is mapped to each incoming multicast stream
Receives data from socket using multicastsocket.receive() in native
bytes
Bytes are parsed and orderbook is prepared
Problem
Inspite of tolerable app latency of around 4 microsec we are not able to receive desirable performance. We believe it is because of network latency.
Tuning steps used
Increased size of following parameters:
netdev_max_backlog
NIC ring buffer receive size
rmem_max
tcp_mem
socketreceivebuffer (in the code)
Question:
We observed that the performance of the application deteriorated after we increased the values of above mentioned parameters. What are the suggested parameters to be optimized & the recommended values. A guide towards optimizing incoming broadcast is requested?
Is there are a way to measure the network latency in a more accurate manner in environment like this. Note that the UDP sender is an external entity (exchange)
Thanks in advance
It is not clear what and how you measure.
You mention that you are receiving UDP, why are you tuning TCP buffer size?
Generally, increasing incoming socket buffer sizes may help you with packet loss on a slow receiver, but it will not reduce latency.
You may like to find out more about bufferbloat:
Bufferbloat is a phenomenon in packet-switched networks, in which excess buffering of packets causes high latency and packet delay variation (also known as jitter), as well as reducing the overall network throughput. When a router device is configured to use excessively large buffers, even very high-speed networks can become practically unusable for many interactive applications like voice calls, chat, and even web surfing.
You also use Java for a low-latency application. People normally fail to achieve this kind of latencies with Java. One of the major reasons being the garbage collector. See Quantifying the Performance
of Garbage Collection vs. Explicit Memory Management for more details:
Comparing runtime, space consumption, and virtual memory footprints
over a range of benchmarks, we show that the runtime performance
of the best-performing garbage collector is competitive with explicit
memory management when given enough memory. In particular,
when garbage collection has five times as much memory
as required, its runtime performance matches or slightly exceeds
that of explicit memory management. However, garbage collection’s
performance degrades substantially when it must use smaller
heaps. With three times as much memory, it runs 17% slower on
average, and with twice as much memory, it runs 70% slower. Garbage
collection also is more susceptible to paging when physical
memory is scarce. In such conditions, all of the garbage collectors
we examine here suffer order-of-magnitude performance penalties
relative to explicit memory management.
People doing HFT using Java often turn off garbage collection completely and restart their systems daily.

ZMQ throughput optimization

I developed an application that has very various zmq-message sizes. In average those are ~177 byte, but in reality most messages are very small < 20b and just few messages have very big size > 3000b.
Now the network is the limiting factor (1gbit ethernet). I can reach ~50MByte/s. Another benchmark told me that the network throughput can reach ~85 MByte/s with a paket size of >256byte.
I think my results are that low due to the fact that most pakets have very small size. Am I right? Is there a possiblity to optimize zmq for using the whole bandwidth for my application, too? Extended batching for example?
Regards
The ZeroMQ guide illustrates the Black Box Pattern for high speed subscribers. In essence, it uses a two stream approach (per node), where each stream has it own I/O thread and subscriber, both of whom are bound to a specific network interface (NIC) and core, so you'll need two network adapters and multi-cores per node for this to work. You can read the full details in the guide.

Does sending large numbers of APNS (push notifications) require particularly high bandwidth or RAM?

Hopefully a fairly simple question, although I haven't found a straight forward answer anywhere yet.
We will releasing our app shortly, push messages are all tested and working. However we have only tested on a smaller scale. All messages to be sent are stored on our VPS, then once per minute they are all sent out at once, and then the table is truncated. So they are not going continuously, but they are going out in batches.
I presume that the APNS itself can handles 100,000s of messages at once, but would our server be capable of sending out 10k or 100k if the app was successful?
The only info I have to hand is this:
Traffic: 300 GB
VPS CPU upper limit in MHz: Unlimited
VPS CPUS: 8 unit
VPS RAM upper limit in MB: 512 MB RAM
However none of the people working on the app have much direct experience with servers, so we don't know if it would bottleneck or not.
Thanks in advance everyone.
This heavily depends on the program responsible for sending those messages. Basically they will be quite small and probably they are loaded row-wise from the table and not stored anywhere afterwards. This means you don't have a lot of RAM usage. However if the program loads all of them in once and is written for example in PHP, you'll have a problem with RAM usage.
If you stay inside the traffic limit can be easily calculated with the number of expected messages times the average size of a message.
CPU will most likely not be a problem since you really don't process anything.
A problem that was not mentioned yet was the number of open connections. Depending on the frequency of updates and how/if you keep the connection alive between updates, if you have 100k users you will probably not be able to manage with only 1 server just because keeping open so many connections is not practicable.

Resources