zeromq and liburing (io_uring) - zeromq

can anybody tell what is the main difference between zeromq and liburing in case of building tcp server in C?
both can handle async requests and work with sockets. zeromq is older than liburing.

io_uring can do real asynchronous IO, zeroMQ instead uses a thread pool.
The mitigations introduced for spectre, meltdown, retbleed and friends mean that context switching is much more expensive now, so thread pools approaches become bottle necked by CPU at high load.
io_uring can sit at 20% cpu usage, while thread pools will starve due to 100% cpu usage.

Related

May I have Project Loom Clarified?

Brian Goetz got me excited about project Loom and, in order to fully appreciate it, I'll need some clarification on the status quo.
My understanding is as follows: Currently, in order to have real parallelism, we need to have a thread per cpu/core; 1) is there then any point in having n+1 threads on an n-core machine? Project Loom will bring us virtually limitless threads/fibres, by relying on the jvm to carry out a task on a virtual thread, inside the JVM. 2) Will that be truly parallel? 3)How, specifically, will that differ from the aforementioned scenario "n+1 threads on an n-core machine "?
Thanks for your time.
Virtual threads allow for concurrency (IO bound), not parallelism (CPU bound). They represent causal simultaneity, but not resource usage simultaneity.
In fact, if two virtual threads are in an IO bound* state (awaiting a return from a REST call for example), then no thread is being used at all. Whereas, the use of normal threads (if not using a reactive or completable semantic) would both be blocked and unavailable for use until the calls are complete.
*Except for certain conditions (e.g., use of synchonize vs ReentrackLock, blocking that occurs in a native method, and possibly some other minor areas).
is there then any point in having n+1 threads on an n-core machine?
For one, most modern n-core machines have n*2 hardware threads because each core has 2 hardware threads.
Sometimes it does make sense to spawn more OS threads than hardware threads. That’s the case when some OS threads are asleep waiting for something. For instance, on Linux, until io_uring arrived couple years ago, there was no good way to implement asynchronous I/O for files on local disks. Traditionally, disk-heavy applications spawned more threads than CPU cores, and used blocking I/O.
Will that be truly parallel?
Depends on the implementation. Not just the language runtime, but also the I/O related parts of the standard library. For instance, on Windows, when doing disk or network I/O in C# with async/await (an equivalent of project loom, released around 2012) these tasks are truly parallel, the OS kernel and drivers are indeed doing more work at the same time. AFAIK on Linux async/await is only truly parallel for sockets but not files, for asynchronous file I/O it uses a pool of OS threads under the hood.
How, specifically, will that differ from the aforementioned scenario "n+1 threads on an n-core machine "?
OS threads are more expensive for a few reasons. (1) They require native stack so each OS thread consumes memory (2) Memory is slow, processors have caches to compensate, switching between OS threads increases RAM bandwidth because thread-specific data invalidates after a context switch (3) OS schedulers were improving over decades but still they’re not free. One reason is saving/restoring thread state to/from memory takes time.
The higher-level cooperative multitasking implemented in C# async/await or Java’s Loom causes way less overhead when switching contexts, compared to switching OS threads. At least in theory, this should improve both throughput and latency for I/O heavy applications.

Is it a good practice to set interrupt affinity and io handling thread affinity to the same core?

I am trying to understand irq affinity and its impact to system performance.
I went through why-interrupt-affinity-with-multiple-cores-is-not-such-a-good-thing, and learnt that NIC irq affinity should be set to a different core other then the core that handling the network data, since the handling will not be interrupted by incoming irq.
I doubt this, since if we use a different core to handling the data from the irq core, we will get more cache miss when trying to retrieve the network data from kernel. So I am actually believe that by setting the irq affinity as the same core of the thread handling the incoming data will improve performance due to less cache miss.
I am trying to come up with some verification code, but before I will present any results, am I missing something?
IRQ affinity is a double edged sword. In my experience it can improve performance but only in a very specific configuration with a pre-defined workload. As far as your question is concerned, (consider only RX path) typically when a NIC card interrupts one of the core, in majority of the cases the interrupt handler will not do much, except for triggering a mechanism (bottom-half, tasklet, kernel thread or networking stack thread) to process incoming packet in some other context. If the same packet processing core is handling the interrupts (ISR handler does not do much), it is bound to loose some cache benefit due to context switches and may increase cache misses. How much is the impact, depends on variety of other factors.
In NIC drivers typically, affinity of core is aligned to each RX queue [separating each RX queue processing across different cores], which provides more performance benefit.
Controlling interrupt affinity can have quite few useful applications. Few that sprint to mind.
isolating cores - preventing I/O load spreading onto mission critical CPU cores, or cores that exclusive for RT priority (scheduled softirqs could starve on those)
increase system performance - throughput and latency - by keeping interrupts on relevant NUMA for multi-CPU systems
CPU efficiency - e.g. dedicating CPU, NIC channel with its interrupt to single application will squeeze more from that CPU, about 30% more with heavy traffic usecases.
This might require flow steering to make incoming traffic target the channel.
Note: without affinity app might seem to deliver more throughput but would spill load onto many cores). The gains is thanks to cache locality, and context switch prevention.
latency - setting up application as above will halve the latency (from 8us to 4us on modern system)

Windows Sleep Microseconds for Low Latency Thread Communication

Is there any way I could suspend the thread for 1-2 microseconds? (Note 1 microsecond = 1/1000 miliseond). I know this question has been asked but so far there is no good answer for this. Here is my situation why I need this:
I have a server with several threads running to serve client's requests. Requests are sent by other thread through shared memory which occurs occasionally. However, once request is sent it needs to be processed ASAP. So, I need a fast way to notify one of the serving threads when request is queued in shared memory.
Windows event object was used but it takes 6-7us from SetEvent() to WaitForSingleObject() return on my machine (tried to set process/thread priority but still not much improvement). I tried to use a busy loop to let the serving threads keep pooling the memory which lower the latency to 1-2us, which is good enough, but it burns the CPU while the requests are only sent like once per minutes. If I could insert a micro/nano second sleep into the loop I could at least get my CPU free while keep the latency low.
I would be glad if anyone could suggest me another way to do the thread communication with latency lower than 2us. Thanks
I think you cannot sleep a thread for that period.
Last time I tested (about 10 years ago, mind, on Windows 2000) sleeping for less than 13ms resulted in a sleep of either 0ms or 13ms.
Windows is not a real-time operating system.
You can spin your thread for that time period, using the high resolution timer.
Honestly though it almost certainly means the design is flawed.
I've just read the post now; the problem is the thread model. You have a worker pool. This inherently requires cross-thread communication, which is problematic since it requires OS services. An alternative is one thread per core, with symmetry between those threads, e.g. they all do everything and you share the load between them.

I/O completion port's advantages and disadvantages

Why do many people say I/O completion port is a fast and nice model?
What are the I/O completion port's advantages and disadvantages?
I want to know some points which make the I/O completion port faster than other approaches.
If you can explain it comparing to other models (select, epoll, traditional multithread/multiprocess), it would be better.
I/O completion ports are awesome. There's no better word to describe them. If anything in Windows was done right, it's completion ports.
You can create some number of threads (does not really matter how many) and make them all block on one completion port until an event (either one you post manually, or an event from a timer or asynchronous I/O, or whatever) arrives. Then the completion port will wake one thread to handle the event, up to the limit that you specified. If you didn't specify anything, it will assume "up to number of CPU cores", which is really nice.
If there are already more threads active than the maximum limit, it will wait until one of them is done and then hand the event to the thread as soon as it goes to wait state. Also, it will always wake threads in a LIFO order, so chances are that caches are still warm.
In other words, completion ports are a no-fuss "poll for events" as well as "fill CPU as much as you can" solution.
You can throw file reads and writes at a completion port, sockets, or anything else that's waitable. And, you can post your own events if you want. Each custom event has at least one integer and one pointer worth of data (if you use the default structure), but you are not really limited to that as the system will happily accept any other structure too.
Also, completion ports are fast, really really fast. Once upon a time, I needed to notify one thread from another. As it happened, that thread already had a completion port for file I/O, but it didn't pump messages. So, I wondered if I should just bite the bullet and use the completion port for simplicity, even though posting a thread message would obviously be much more efficient. I was undecided, so I benchmarked. Surprise, it turned out completion ports were about 3 times faster. So... faster and more flexible, the decision was not hard.
by using IOCP, we can overcome the "one-thread-per-client" problem. It is commonly known that the performance decreases heavily if the software does not run on a true multiprocessor machine. Threads are system resources that are neither unlimited nor cheap.
IOCP provides a way to have a few (I/O worker) threads handle multiple clients' input/output "fairly". The threads are suspended, and don't use the CPU cycles until there is something to do.
Also you can read some information in this nice book http://www.amazon.com/Windows-System-Programming-Johnson-Hart/dp/0321256190
I/O completion ports are provided by the O/S as an asynchronous I/O operation, which means that it occurs in the background (usually in hardware). The system does not waste any resources (e.g. threads) waiting for the I/O to complete. When the I/O is complete, the hardware sends an interrupt to the O/S, which then wakes up the relevant process/thread to handle the result. WRONG: IOCP does NOT require hardware support (see comments below)
Typically a single thread can wait on a large number of I/O completions while taking up very little resources when the I/O has not returned.
Other async models that are not based on I/O completion ports usually employ a thread pool and have threads wait for I/O to complete, thereby using more system resources.
The flip side is that I/O completion ports usually require hardware support, and so they are not generally applicable to all async scenarios.

Windows, multiple process vs multiple threads

We have to make our system highly scalable and it has been developed for windows platform using VC++. Say initially, we would like to process 100 requests(from msmq) simultaneously. What would be the best approach? Single process with 100 threads or 2 processes with 50-50 threads? What is the gain apart from process memory in case of second approach. does in windows first CPU time is allocated to process and then split between threads for that process, or OS counts the number of threads for each process and allocate CPU on the basis of threads rather than process. We notice that in first case, CPU utilization is 15-25% and we want to consume more CPU. Remember that we would like to get optimal performance thus 100 requests are just for example. We have also noticed that if we increase number of threads of the process above 120, performance degrades due to context switches.
One more point; our product already supports clustering, but we want to utilize more CPU on the single node.
Any suggestions will be highly appreciated.
You cant process more requests than you have CPU cores. "fast" scalable solutions involve setting up thread pools, where the number of active (not blocked on IO) threads == the number of CPU cores. So creating 100 threads because you want to service 100 msmq requests is not good design.
Windows has a thread pooling mechanism called IO Completion Ports.
Using IO Completion ports does push the design to a single process as, in a multi process design, each process would have its own IO Completion Port thread pool that it would manage independently and hence you could get a lot more threads contending for CPU cores.
The "core" idea of an IO Completion Port is that its a kernel mode queue - you can manually post events to the queue, or get asynchronous IO completions posted to it automatically by associating file (file, socket, pipe) handles with the port.
On the other side, the IO Completion Port mechanism automatically dequeues events onto waiting worker threads - but it does NOT dequeue jobs if it detects that the current "active" threads in the thread pool >= the number of CPU cores.
Using IO Completion Ports can potentially increase the scalability of a service a lot, usually however the gain is a lot smaller than expected as other factors quickly come into play when all the CPU cores are contending for the services other resource.
If your services are developed in c++, you might find that serialized access to the heap is a big performance minus - although Windows version 6.1 seems to have implemented a low contention heap so this might be less of an issue.
To summarize - theoretically your biggest performance gains would be from a design using thread pools managed in a single process. But you are heavily dependent on the libraries you are using to not serialize access to critical resources which can quickly loose you all the theoretical performance gains.
If you do have library code serializing your nicely threadpooled service (as in the case of c++ object creation&destruction being serialized because of heap contention) then you need to change your use of the library / switch to a low contention version of the library or just scale out to multiple processes.
The only way to know is to write test cases that stress the server in various ways and measure the results.
The standard approach on windows is multiple threads. Not saying that is always your best solution but there is a price to be paid for each thread or process and on windows a process is more expensive. As for scheduler i'm not sure but you can set the priory of the process and threads. The real benefit to threads is their shared address space and the ability to communicate without IPC, however synchronization must be careful maintained.
If you system is already developed, which it appears to be, it is likely to be easier to implement a multiple process solution especially if there is a chance that latter more then one machine may be utilized. As your IPC from 2 process on one machine can scale to multiple machines in the general case. Most attempts at massive parallelization fail because the entire system is not evaluated for bottle necks. for example if you implement a 100 threads that all write to the same database you may gain little in actual performance and just wait on your database.
just my .02

Resources