Is it a good practice to set interrupt affinity and io handling thread affinity to the same core? - performance

I am trying to understand irq affinity and its impact to system performance.
I went through why-interrupt-affinity-with-multiple-cores-is-not-such-a-good-thing, and learnt that NIC irq affinity should be set to a different core other then the core that handling the network data, since the handling will not be interrupted by incoming irq.
I doubt this, since if we use a different core to handling the data from the irq core, we will get more cache miss when trying to retrieve the network data from kernel. So I am actually believe that by setting the irq affinity as the same core of the thread handling the incoming data will improve performance due to less cache miss.
I am trying to come up with some verification code, but before I will present any results, am I missing something?

IRQ affinity is a double edged sword. In my experience it can improve performance but only in a very specific configuration with a pre-defined workload. As far as your question is concerned, (consider only RX path) typically when a NIC card interrupts one of the core, in majority of the cases the interrupt handler will not do much, except for triggering a mechanism (bottom-half, tasklet, kernel thread or networking stack thread) to process incoming packet in some other context. If the same packet processing core is handling the interrupts (ISR handler does not do much), it is bound to loose some cache benefit due to context switches and may increase cache misses. How much is the impact, depends on variety of other factors.
In NIC drivers typically, affinity of core is aligned to each RX queue [separating each RX queue processing across different cores], which provides more performance benefit.

Controlling interrupt affinity can have quite few useful applications. Few that sprint to mind.
isolating cores - preventing I/O load spreading onto mission critical CPU cores, or cores that exclusive for RT priority (scheduled softirqs could starve on those)
increase system performance - throughput and latency - by keeping interrupts on relevant NUMA for multi-CPU systems
CPU efficiency - e.g. dedicating CPU, NIC channel with its interrupt to single application will squeeze more from that CPU, about 30% more with heavy traffic usecases.
This might require flow steering to make incoming traffic target the channel.
Note: without affinity app might seem to deliver more throughput but would spill load onto many cores). The gains is thanks to cache locality, and context switch prevention.
latency - setting up application as above will halve the latency (from 8us to 4us on modern system)

Related

Fastest way for one core to signal another?

On an Intel CPU, I want CPU core A to signal CPU core B when A has completed an event. There are couple ways to do this:
A sends an interrupt to B.
A writes a cache line (e.g., a bit flip) and B polls the cache line.
I want B to learn about the event with the least amount of overhead possible. Note that I am referring to overhead, not end-to-end latency. It's alright if B takes a while to learn about the event (e.g., periodic polling is fine), but B should waste as few cycles as possible detecting the event.
Option 1 above has too much overhead due to the interrupt handler. Option 2 is better, but I am still unhappy with the amount of time that B must wait for the cache line to transfer from A's L1 cache to its own L1 cache.
Is there some way A can directly push the cache line into B's L1 cache? It's fine if there is additional overhead for A in this case. I'm not sure if there some trick I can try where A marks the page as uncacheable and B marks the page as write-back...
Alternatively, is there some other mechanism built into Intel processors that can help with this?
I assume this is less of an issue on AMD CPUs as they use the MOESI coherence protocol, so the "O" should presumably allow A to broadcast the cache line changes to B.
There's disappointingly little you can do about this on x86 without some very recent ISA extensions, like cldemote (Tremont or Alder Lake / Sapphire Rapids) or user-space IPI (inter-processor interrupts) in Sapphire Rapids, and maybe also Alder Lake. (See Why didn't x86 implement direct core-to-core messaging assembly/cpu instructions? for details on UIPI.)
Without any of those features, the choice between occasional polling (or monitor/mwait if the other core has nothing to do) vs. interrupt depends on how many times you expect to poll before you want to send a notification. (And how much potential throughput you'll lose due to any knock-on effects from the other thread not noticing the flag update soon, e.g. if that means larger buffers leading to more cache misses.)
In user-space, other than shared memory or UIPI, the alternatives are OS-delivered inter-process-communications like a signal or a pipe write or eventfd; the Linux UIPI benchmarks compared it to various mechanisms for latency and throughput IIRC.
AMD CPUs don't broadcast stores; that would swamp the interconnect with traffic and defeat the benefit of private L1d cache for lines that get repeatedly written (between accesses from other cores, even if it avoided it for lines that weren't recently shared.)

The effects of heavy thread consumption on ARM (4-core A72) vs x86 (2-core i5)

I have a realtime linux desktop application (written in C) that we are porting to ARM (4-core Cortex v8-A72 CPUs). Architecturally, it has a combination of high-priority explicit pthreads (6 of them), and a couple GCD(libdispatch) worker queues (one concurrent and another serial).
My concerns come in two areas:
I have heard that ARM does not hyperthread the way that x86 can and therefore my 4-cores will already be context switching to keep up with my 6 pthreads (and background processes). What kind of performance penalty should I expect from this?
I have heard that I should expect these ARM context-switches to be less efficient than x86. Is that true?
A couple of the pthreads are high-priority handlers for fairly rare-ish events, does this change the prospects much?(i.e. they are sitting on a select statement)
My bigger concern comes from the impact of GCD in this application. My understanding of the inner workings of GCD is a that it is a dynamically scaled threadpool that interacts with the scheduler, and will try to add more threads to suit the load. It sounds to me like this will have an almost exclusively negative impact on performance in my scenario. (I.E. in a system whose cores are fully consumed) Correct?
I'm not an expert on anything x86-architecture related (so hopefully someone more experienced can chime in) but here are a few high level responses to your questions.
I have heard that ARM does not hyperthread the way that x86 can [...]
Correct, hyperthreading is a proprietary Intel chip design feature. There is no analogous ARM silicon technology that I am aware of.
[...] and therefore my 4-cores will already be context switching to keep up with my 6 pthreads (and background processes). What kind of performance penalty should I expect from this? [...]
This is not necessarily the case, although it could very well happen in many scenarios. It really depends more on what the nature of your per-thread computations are...are you just doing lots of hefty computations, or are you doing a lot of blocking/waiting on IO? Either way, this degradation will happen on both architectures and it is more of a general thread scheduling problem. In hyperthreaded Intel world, each "physical core" is seen by the OS as two "logical cores" which share the same resources but have their own pipeline and register sets. The wikipedia article states:
Each logical processor can be individually halted, interrupted or directed to execute a specified thread, independently from the other logical processor sharing the same physical core.[7]
Unlike a traditional dual-processor configuration that uses two separate physical processors, the logical processors in a hyper-threaded core share the execution resources. These resources include the execution engine, caches, and system bus interface; the sharing of resources allows two logical processors to work with each other more efficiently, and allows a logical processor to borrow resources from a stalled logical core (assuming both logical cores are associated with the same physical core). A processor stalls when it is waiting for data it has sent for so it can finish processing the present thread. The degree of benefit seen when using a hyper-threaded or multi core processor depends on the needs of the software, and how well it and the operating system are written to manage the processor efficiently.[7]
So if a few of your threads are constantly blocking on I/O then this might be where you would see more improvement in a 6-thread application on a 4 physical core system (for both ARM and intel x86) since theoretically this is where hyperthreading would shine....a thread blocking on IO or on the result of another thread can "sleep" while still allowing the other thread running on the same core to do work without the full overhead of an thread switch (experts please chime in and tell me if I'm wrong here).
But 4-core ARM vs 2-core x86... assuming all else equal (which obviously is not the case, in reality clock speeds, cache hierarchy etc. all have a huge impact) then I think that really depends on the nature of the threads. I would imagine this drop in performance could occur if you are just doing a ton of purely cpu-bound computations (i.e. the threads never need to wait on anything external to the CPU). But If you are doing a lot of blocking I/O in each thread, you might show significant speedups doing up to probably 3 or 4 threads per logical core.
Another thing to keep in mind is the cache. When doing lots of cpu-bound computations, a thread switch has the possibility to blow up the cache, resulting in much slower memory access initially. This will happen across both architectures. This isn't the case with I/O memory, though. But if you are not doing a lot of blocking things however, then the extra overhead with threading will just make it slower for the reasons above.
I have heard that I should expect these ARM context-switches to be less efficient than x86. Is that true?
A hardware context switch is a hardware context switch, you push all the registers to the stack and flip some bits to change execution state. So no, I don't believe either is "faster" in that regard. However, for a single physical core, techniques like hyperthreading makes a "context switch" in the Operating Systems sense (I think you mean switching between threads) much faster, since the instructions of both programs were already being executed in parallel on the same core.
I don't know anything about GCD so can't comment on that.
At the end of the day, I would say your best shot is to benchmark the application on both architectures. See where your bottlenecks are. Is it in memory access? Keeping the cache hot therefore is a priority. I imagine that 1-thread per core would always be optimal for any scenario, if you can swing it.
Some good things to read on this matter:
https://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html
https://lwn.net/Articles/250967/
Optimal number of threads per core
Thread context switch Vs. process context switch

What's the process of disabling interrupt in multi-processor system?

I have a textbook statement says disabling interrupt is not recommended in multi-processor system, and it will take too much time. But I don't understand this, can anyone show me the process of multi-processor system disabling the interrupts? Thanks
on x86 (and other architectures, AFAIK), enabling/disabling interrupts is on a per-core basis. You can't globally disable interrupts on all cores.
Software can communicate between cores with inter-processor interrupts (IPIs) or atomic shared variables, but even so it would be massively expensive to arrange for all cores to sit in a spin-loop waiting for a notification from this core that they can re-enable interrupts. (Interrupts are disabled on other cores, so you can't send them an IPI to let them know when you're done your block of atomic operations.) You have to interrupt whatever all 7 other cores (e.g. on an 8-way SMP system) are doing, with many cycles of round-trip communication overhead.
It's basically ridiculous. It would be clearer to just say you can't globally disable interrupts across all cores, and that it wouldn't help anyway for anything other than interrupt handlers. It's theoretically possible, but it's not just "slow", it's impractical.
Disabling interrupts on one core doesn't make something atomic if other threads are running on other cores. Disabling interrupts works on uniprocessor machines because it makes a context-switch impossible. (Or it makes it impossible for the same interrupt handler to interrupt itself.)
But I think my confusion is that for me the difference between 1 core and 8 core is not a big number for me; why disabling all of them from interrupt is time consuming.
Anything other than uniprocessor is a fundamental qualitative difference, not quantitative. Even a dual-core system, like early multi-socket x86 and the first dual-core-in-one-socket x86 systems, completely changes your approach to atomicity. You need to actually take a lock or something instead of just disabling interrupts. (Early Linux, for example, had a "big kernel lock" that a lot of things depended on, before it had fine-grained locking for separate things that didn't conflict with each other.)
The fundamental difference is that on a UP system, only interrupts on the current CPU can cause things to happen asynchronously to what the current code is doing. (Or DMA from devices...)
On an SMP system, other cores can be doing their own thing simultaneously.
For multithreading, getting atomicity for a block of instructions by disabling interrupts on the current CPU is completely ineffective; threads could be running on other CPUs.
For atomicity of something in an interrupt handler, if this IRQ is set up to only ever interrupt this core, disabling interrupts on this core will work. Because there's no threat of interference from other cores.

Processor pipeline state preservation

Is there any situation where the state of the processor pipeline (with already decoded or prefetched instructions) is saved and subsequently reloaded after resumption during a thread sleep/ context switch / interrupt etc.? (May be as a optimization).
This isn't possible for any CPU I'm aware of. There's no interface for doing it, and no conditions under which a CPU does it on its own. Dumping a huge amount of internal CPU state to RAM would take more cycles than it would save. Having the OS keep track of the variable-size chunks of RAM needed for this would just make the overhead worse.
If anything was worth saving, BTW, it would be results of already executed instruction that can't retire yet, because of a load that missed in cache. (All the common out-of-order execution designs for mainstream ISAs use in-order retirement to support precise exceptions. Out-of-order retirement with checkpointing / rollback on exceptions and mispredicts has been proposed. Search kilo-instruction processor, IIRC.)
(flawed idea): An aggressive out-of-order design could avoid wasting too much work on context switches by delaying the write of the interrupt-return address when an external interrupt arrives. i.e. they could pretend that the interrupt came in later than it did by allowing some instructions already in the pipeline to keep executing. If the user-space instruction pointer isn't needed until the interrupt handler returns, the CPU could clearing the pipeline.
Hrm, this has the major difficulty that register values on entry into the interrupt handler also depends on the architectural state, so this probably can't work.
This def. can't work for interrupts generated by user-space, because that fixes the return address.
This isn't an issue for threads that put themselves to sleep while waiting on a spinlock with monitor / mwait or something. mwait presumably doesn't take effect until it retires, and it won't retire until all previous work has been done. It would defeat the intended purpose for the CPU to be aggressive about speculatively executing past mwait, I think. Or maybe mwait doesn't even flush the pipeline, and just saves power.
The idea has been proposed, but you'd need a much denser memory technology which is only now becoming available. See this paper for example:
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6489970
Basically, they propose pipes composed of a new set of latches & registers based on memristors (resistive non-volatile memory components), that can hold multiple values corresponding to multiple threads. Control logic can then tell all latches which thread should be active, and allow simultaneous context switching throughout the entire pipe.
Keep in mind that this only enhances the granularity to the latch level. Modern CPUs with simultaneous multithreading can already have different threads active on different units level without context switches, simple through arbitration. Other units with inherent parallelism may already handle multiple threads per cycle (e.g. - multi-ported ALUs)

CPU Utilization and threads

We have a transaction intensive process at one customer site running on a quad core server with four processors. The process is designed to take advantage of every core available. So in this installation, we take an input queue, divide it by 16th's and allocate each fraction of the queue to a core. It works well and keeps up with the transaction volume on the box.
Looking at the CPU utilization on the box, it never seems to go above 33%. Now we have a new customer with at least double the volume of the existing customer. Some of us are arguing that since CPU usage is way below maximum utilization, that we should go with the same configuration.
Others claim that there is no direct correlation between cpu utilization and transaction processing speed and since the logic of the underlying software module is based on the number of available cores, that it makes sense to obtain a box with proportionately more cores available for the new client to accommodate the increased traffic volume.
Does anyone have a sense as to who is right in this instance?
Thank you,
To determine the optimum configuration for your new customer, understanding the reason for low CPU usage is paramount.
Very likely, the reason is one of the following:
Your process is limited by memory bandwidth. In this case, faster RAM will help if supported by the motherboard. If possible, a redesign to limit the amount of data accessed during processing will improve performance. Adding more CPU cores will, on its own, do nothing to improve performance.
Your process is limited by disk I/O. Using faster disk connections (SATA etc.) and/or upgrading to a SSD might help, but more CPU power will not.
Your process is limited by synchronization contention. In this case, adding more threads for more cores might even be counter productive. Redesigning your algorithm might help in this case.
Having said this, I have also seen situations where processes that are definitely CPU bound fail to achieve 100% CPU usage on modern processors (Core i7 etc.) because in certain turbo boost relevant cases, task manager will show less than 100%.
As 9000 said, you need to find out what your bottlenecks are when under load. Perfmon might provide enough data to find out.
Another afterthought: You could limit your process on the existing machine to part of the cores (but still at least 30% so that theoretically, CPU doesn't become a bottleneck due to this limitation) and check if overall throughput degrades. If it does not, adding more cores will not improve performance.

Resources