Optimizing incoming UDP broadcast in Linux - performance

Environment
Linux/RedHat
6 cores
Java 7/8
10G
Application
Its a low latency high frequency trading application
Receives broadcast via multicast UDP
There are multiple datastreams
Each Incoming packet size is less than 1K(fixed size)
Application latency is around 4 microsecond
Architecture
Separate application thread is mapped to each incoming multicast stream
Receives data from socket using multicastsocket.receive() in native
bytes
Bytes are parsed and orderbook is prepared
Problem
Inspite of tolerable app latency of around 4 microsec we are not able to receive desirable performance. We believe it is because of network latency.
Tuning steps used
Increased size of following parameters:
netdev_max_backlog
NIC ring buffer receive size
rmem_max
tcp_mem
socketreceivebuffer (in the code)
Question:
We observed that the performance of the application deteriorated after we increased the values of above mentioned parameters. What are the suggested parameters to be optimized & the recommended values. A guide towards optimizing incoming broadcast is requested?
Is there are a way to measure the network latency in a more accurate manner in environment like this. Note that the UDP sender is an external entity (exchange)
Thanks in advance

It is not clear what and how you measure.
You mention that you are receiving UDP, why are you tuning TCP buffer size?
Generally, increasing incoming socket buffer sizes may help you with packet loss on a slow receiver, but it will not reduce latency.
You may like to find out more about bufferbloat:
Bufferbloat is a phenomenon in packet-switched networks, in which excess buffering of packets causes high latency and packet delay variation (also known as jitter), as well as reducing the overall network throughput. When a router device is configured to use excessively large buffers, even very high-speed networks can become practically unusable for many interactive applications like voice calls, chat, and even web surfing.
You also use Java for a low-latency application. People normally fail to achieve this kind of latencies with Java. One of the major reasons being the garbage collector. See Quantifying the Performance
of Garbage Collection vs. Explicit Memory Management for more details:
Comparing runtime, space consumption, and virtual memory footprints
over a range of benchmarks, we show that the runtime performance
of the best-performing garbage collector is competitive with explicit
memory management when given enough memory. In particular,
when garbage collection has five times as much memory
as required, its runtime performance matches or slightly exceeds
that of explicit memory management. However, garbage collection’s
performance degrades substantially when it must use smaller
heaps. With three times as much memory, it runs 17% slower on
average, and with twice as much memory, it runs 70% slower. Garbage
collection also is more susceptible to paging when physical
memory is scarce. In such conditions, all of the garbage collectors
we examine here suffer order-of-magnitude performance penalties
relative to explicit memory management.
People doing HFT using Java often turn off garbage collection completely and restart their systems daily.

Related

Will memory that is physically closer to the CPU perform faster than memory physically farther away?

I know this may sound like a silly question considering the speeds at which computers work, but say a certain address in RAM is physically closer to the CPU on the motherboard, compared to a memory address that is located the farthest possible to the CPU, will this have an affect on the speed that the closer memory address is accessed compared to the farthest memory address?
If you're talking about NUMA accessing RAM connected to this socket vs. going over the interconnect to access RAM connected to another socket, then yes this is a well known effect. example. Otherwise, no.
Also note that signal travel time over the external memory bus is only tiny fraction of the total latency cache-miss latency cost for a CPU core. Queuing inside the CPU, time to check L3 cahce, and the internal bus between cores and memory controllers, all adds up. Tightening DDR4 CAS latency by 1 whole memory cycle makes only a small (but measurable) difference to overall memory performance (see hardware review sites benchmarking memory overclocking), other timings even less so.
No, DDR4 (and earlier) memory busses are synced to a clock and expect a response at a specific number of memory-clock cycles1 after a command (so the controller can pipeline requests without causing overlap). See What Every Programmer Should Know About Memory? for more about DDR memory commands and memory timings (and CAS latency vs. other timings).
(Wikipedia's introduction to SDRAM mentions that earlier DRAM standards were asynchronous, so yes they maybe could just reply as soon as they had the data ready. If that happened to be a whole clock cycle early, a speedup was perhaps possible.)
So memory latency is discrete, not continuous, and being 1 mm closer can't make it fractions of a nanosecond faster. The only plausible effect is if you socket all the memory into DIMM slots in a way that enables you to run tighter timings and/or a faster memory clock than with some other arrangement. Go read about memory overclocking if you want real-world experience with people who try to push systems to the limits of stability. What's best may depend on the motherboard; physical length of traces isn't the only consideration.
AFAIK, all real-world motherboard firmwares insist on using the same timings for all DIMMs on all memory channels2.
So even if one DIMM could theoretically support tighter timings than another, you couldn't actually configure a system to make that happen. e.g. because of shorter or less noisy traces, less signal reflection because it's at the end instead of middle of some traces, or whatever. Physical proximity isn't the only thing that could help.
(This is probably a good thing; interleaving physical address space across multiple DRAM channels allows sequential reads/writes to benefit from the aggregate bandwidth of all channels. But if they ran at different speeds, you might have more contention for shared busses between controllers and cores, and more time left unused.)
Memory frequency and timings are usually chosen by the firmware after reading the SPD ROM on each DIMM (memory module) to find out what memory is installed and what timings each DIMM is rated for at what frequencies.
Footnote 1: I'm not sure how transmission-line propagation delays over memory traces are accounted for when the memory controller and DIMM agree on how many cycles there should be after a read command before the DIMM starts putting data on the bus.
The CAS latency is a timing number that the memory controller programs into the "mode register" of each DIMM.
Presumably the number the DIMM sees is the actual number it uses, and the memory controller has to account for the round-trip propagation delay to know when to really expect a read burst to start arriving. Other command latencies are just times between sending different commands so propagation delay doesn't matter: the gap at the sending side equals the gap at the receiving side.
But the CAS latency seen by the memory controller includes the round-trip propagation delay for signals to go over the wires to the DIMM and back. Modern systems with DDR4-4000 have a clock that runs at 2GHz, cycle time of half a nanosecond (and transferring data on the rising and falling edge).
At light speed, 0.5ns is "only" about 15 cm, half of one of Grace Hopper's nanoseconds, and with transmission-line effects could be somewhat shorter (like maybe 2/3rd of that). On a big server motherboard it's certainly plausible that some DIMMs are far enough away from the CPU for traces to be that long.
The rated speeds on memory DIMMs are somewhat conservative so they're still supposed to work at that speed even when as far as allowed by DDR4 standards. I don't know the details, but I assume JEDEC considers this when developing DDR SDRAM standards.
If there's a "data valid" pin the DIMM asserts at the start of the read burst, that would solve the problem, but I haven't seen a mention of that on Wikipedia.
Timings are those numbers like 9-9-9-24, with the first one being CAS latency, CL. https://www.hardwaresecrets.com/understanding-ram-timings/ was an early google hit if you want to read more from a perf-tuning PoV. Also described in Ulrich Drepper's "What Every Programmer Should Know about Memory" linked earlier, from a how-it-works PoV. Note that the higher the memory clock speed, the less real time (in nanoseconds) a given number of cycles is. So CAS latency and other timings have stayed nearly constant in nanoseconds as clock frequencies have increase, or even dropped. https://www.crucial.com/articles/about-memory/difference-between-speed-and-latency shows a table.
Footnote 2: Unless we're talking about special faster memory for use as a scratchpad or cache for the larger main memory, but still off-chip. e.g. the 16GB of MCDRAM on Xeon Phi cards, separate from the 384 GB of regular DDR4. But faster memories are usually soldered down so timings are fixed, not socketed DIMMs. So I think it's fair to say that all DIMMs in a system will run with the same timings.
Other random notes:
https://www.overclock.net/threads/ram-4x-sr-or-2x-dr-for-ryzen-3000.1729606/ contained some discussion of motherboards with a "T-topology" vs. "daisy chain" for the layout of their DIMM sockets. This seems pretty self-explanatory terminology: a T would be when each of the 2 DIMMs on a channel are on opposite sides of the CPU, about equidistant from the pins. vs. "daisy chain" when both DIMMs for the same channel are on the same side of the CPU, with one farther away than the other.
I'm not sure what the recommended practice is for using the closer or farther socket. Signal reflection could be more of a concern with the near socket because it's not the end of the trace.
If you have multiple DIMMs on the same memory channel by the "chip-enable" pin , the DDR4 protocol may require they all run at the same timings. (Such DIMMs see each others commands, except there's a "chip-select" pin that the memory controller can control independently for each DIMM to control which one the command is for.
But in theory a CPU could be designed to run its different memory channels at different frequencies, or at least different timings at the same frequency if the memory controllers all share a clock. And of course in a multi-socket system, you'd expect no physical / electrical obstacle to programming different timings for the different sockets.
(I haven't played around in the BIOS on a multi-socket system for years, not since I was a cluster sysadmin in AMD K8 / K10 days). So IDK, it's possible that some BIOS might have options to control different timings for different sockets, or simply allow different auto-detect if you use slower RAM in one socket than in others. But given the price of servers and how few people run them as hobby machines, it's unlikely that vendors would bother to support or validate such a config.

Is coalesced memory access a feature or phenomenon?

I'm current writing a smaller project in OpenCL, and I'm trying to find out what really causes memory coalescing. Every book on GPGPU programming says it's how GPGPUs should be programmed, but not why the hardware would prefer this.
So is it some special hardware component which merges data transfers? Or is it simply to better utilize the cache? Or is it something completely different?
Memory coalescing makes several different things more efficient. It is usually done before the requests hit the cache. Similar to the SIMT execution model it is a architectural trade-off. It enables GPUs to have a more efficient and very high performance memory system but also forces programmers to think carefully about their data layout.
Without coalescing either the cache needs to be able to serve a huge number of requests at the same time or memory access would take a lot longer as the different data transfers would need to be handled one at a time. This is even relevant when just checking if something is a hit or a miss.
Merging requests is rather easy to do, you just pick one transfer and then merge all requests with matching upper address bits. You just generate a single request per cycle and replay the load or store instruction until all threads have been handled.
Caches also stores consecutive bytes, 32/64/128Byte, this fits most applications well, is a good fit to modern DRAM and reduces the overhead for cache bookeeping information: The cache is organized in cachelines and each cacheline has a tag that indicates which addresses are stored in the line.
Modern DRAM uses wide interfaces and also long bursts: The memory of a GPU is typically organized in 32-bit or 64-bit wide channels with GDDR5 memory that has a burst length of 8. This means that every transaction at the DRAM interface has to fetch at least 32-bit*8=32 byte or 64-bit*8=64 byte at a time, even if just a single byte is required from these bytes. Designing data layouts that lead to coalesced requests helps to use the DRAM interface efficiently.
GPUs also have a huge number of parallel threads active at the same time and rather small cache at the same time. CPUs are often able to use their caches to reorder their memory requests to DRAM friendly patterns. The larger number of threads and smaller caches on GPUs make this "cache based coalescing" less efficient on GPUs, as the data will often not stay long enough in the cache to get merged at the cache with other requests to the same cacheline.
Despite the "random access" name on "RAM" (Random-access Memory), Double-Data-Rate #3 Random-Access Memory (DDR3-RAM) is faster at accessing consecutive positions rather than randomly.
Case in point: "CAS Latency" is the amount of time that DDR3 RAM will stall when you're accessing a new "column", as your RAM chip is literally charging up to serve the new data from another location on the chip.
EDIT: Jan Lucas argues that RAS Latency is more important in practice. See his comment for details.
There's roughly a 10ns delay whenever you switch columns. So, if you have a bunch of memory accesses, if you keep access a bunch of data 'close' to each other, then you don't invoke a CAS delay.
So if you have 20-words to access at a particular location, its more efficient to access those 20-words before moving to a new memory location (invoking a CAS delay). Otherwise, you'll have to invoke ANOTHER CAS delay to "switch back" between memory locations.
Its just around 10 nanoseconds, but that amount of time adds up over time.

Reaching limits of Apache Storm

We are trying to implement a web application with Apache Storm.
Applicationreceives a huge load of ad-requests (100 TPS - a hundred transactions / second ),makes some simple calculation on them and then stores the result in a NoSQL database with a maximum latency of 10 ms.
We are using Cassandra as a sink for its writing capabilities.
However, we have already overpassed the 8 ms requirement, we are in 100ms.
We tried to minimize the size of buffers (Disruptor buffers) and to well balance the topology, using the parallelism of bolts.
But we still in 20ms.
With 4 worker ( 8 cores / 16GB ) we are at 20k TPS which is still very low.
Is there any suggestions for optimization orare we just reaching the limits of Apache Storm(limits of Java)?
I don't know the platform you're using, but in C++ 10ms is eternity. I would think you are using the wrong tools for the job.
Using C++, serving some local query should take under a microsecond.
Non-local queries that touch multiple memory locations and/or have to wait for disk or network I/O, have no choice but taking more time. In this case parallelism is your best friend.
You have to find the bottleneck.
Is it I/O?
Is it CPU?
Is it memory bandwidth?
Is it memory access time?
After you've found the bottleneck, you can either improve it, async it and/or multiply (=parallelize) it.
There's a trade-off between low latency and high throughput.
If you really need to have high throughput, you should rely on batching adjusting size of buffers bigger, or using Trident.
Trying to avoid transmitting tuples to other workers helps low latency. (localOrShuffleGrouping)
Please don't forget to monitor GC which causes stop-the-world. If you need low-latency, it should be minimized.

Estimating how processor frequency affects I/O performance

I am doing research about dedicated I/O software that would run on consumer hardware. Essentially it boils down to saving huge data streams for later processing. Right now I am looking for a model to estimate performance factors on x86.
Take for example the new Macbook Pro:
high-speed Thunderbolt I/O (input/output) technology delivers
an amazing 10 gigabits per second of transfer speeds in both
directions
1.25 GB/s sounds nice but most processors of the day are clocked around 2 Ghz. Multiple cores make little difference as long as only one can be assigned per network channel.
So even if the software acts as a miniature operating system and limits itself to network/disk operations, the amount of data flowing to storage can't be greater than P / (2 * N)[1] chunks per second. Although this hints the rough performance limit, I feel it's far from adequate.
What other considerations should one take estimating I/O performance in regards to processor frequency and other hardware specifics? For simplicity's sake, assume here that storage performs instantly under all circumstances.
[1] P - processor frequency; N - algorithm overhead
The hardware limiting factors are probably the I/O bus performance, say PCIe, and more recently, the FSB clock-rates, since memory controllers are moving from northbridge to the CPUs themselves.
Then, of course, you have to figure out what sort of processing you need to do on the input, and how much work it is to produce the output. These, at least for conventional software running on a CPU, are dependent on the processor clock, but not only. Writing your code to take advantage of the hardware facilities like caches, instruction-level parallelism, etc. is still a black art but can give you an order of magnitude performance boost.
Basically what I'm ranting about is that not all software is created equal, and you probably want to take that into account.
Likely, harddisk controllers will decide the harddisk I/O performance, graphics cards will decide maximum resolution and refresh I/O performance, and so on. Don't really understand the question, the CPU is becoming less and less involved in these kinds of things (well, has been for the last 10 years).
I doubt the question will even have bearing on CPUs with integrated GPUs, since the buffer to be output to screen is in external memory sharing a bus with (again) a controller on the motherboard.
It's all buffered, so I can only see CPUs affecting file performance if you somehow force the hardware buffer size to something insanely puny. Edit: and I'm pretty sure Apple will prevent you from doing such things. ;)
For Thunderbolt specifically, it's more about what the minimum CPU model is, that supports the kinds of bus speeds required by the Thunderbolt chip set version that is in the machine in question.
Thunderbolt is a raw data traffic system and performance specs are potential maximums, hence all the asterisks in the Apple specs. I believe it will indeed alleviate bottlenecks and in general give lag-free intelligent data shuffling doing many things simultaneously.
The CPU will idle-wait a shorter time for needed data, but the processing speed of the data is the same. When playing or creating a movie, codec processing time will be the same, but you will still feel a boost/lack of lag because the data is there when it needs it. For the I/O, the bottleneck will become the read/write speed of your harddisk instead, and the CPU bottleneck (for file copy operations, likely at least some code in Finder) will stay the same.
In other words, only CPU-intensive tasks such as for example movie encoding will benefit significantly from a faster CPU, while the benefits of Thunderbolt vs. a mix of interfaces will boost machines with both slow and fast CPUs.

The same driver for multiple network cards -- performance bottleneck?

I'm using driver e1000e for multiple Intel network cards (Intel EXPI9402PT, based on 82571EB chip). The problem is that when I'm trying to utilize maximum speed (1GB) on more than one interface, speed on each interface starts to drop down.
I have my own driver in kernel space designed to just sent given packets. It just allocs packets by:
skb = dev_alloc_skb(packet->len);
and them sends them by:
result = dev->hard_start_xmit(skb,dev);
Each interface has its own instance of the driver.
For one interface I get: 120435948 bytes/sec.
For two interfaces I get: 61080233 bytes/sec and 60515294 bytes/sec.
For three interfaces I get: 28564020 bytes/sec, 27111184 bytes/sec, 27118907 bytes/sec.
What can be the cause? Is the hard_start_xmit function reentrant?
This is most likely due to a lack of bandwidth over your motherboard.
If you're trying to pump 3 Gb/s of information through a bus slower than 3 Gb/s, you'll have problems. What sort of bus are these cards on?
There may be a fix, but I think this is a physical limitation of the board, not necessarily your driver.
When I add the numbers together for 2 interfaces, the net result is slightly bigger than the output for a single interface. To me, this means the system is being slightly more efficient when using both interfaces. One possible reason might be better CPU utilization or possibly bus utilization. But note that the result is only slightly better and probably indicates that the resource causing the bottle neck is limited to 121MB/s. Once the load (3 active interfaces) exceeds this limit, performance drops dramatically to 82MB/s.
It is hard to pin down the exact cause without some additional measurements, but my guesses would be
CPU limited : Adding multiple CPU to the system would rule this out as a problem.
Memory limited : Remember that even if the device is in a x4 or x8 slot, the connection to main memory (i.e. where the SKB live) may not be able to sustain that load.
Interrupt limited : The packets per second might be high enough that switching in and out of interrupt contexts is hurting performance. This is less likely as most drivers are good about interrupt coalescing, but if possible, switch the driver to a polled mode to rule this out.

Resources