Difference between SoftIRQs and Tasklets - linux-kernel

While studying Linux interrupt handling I found that Tasklets and SoftIRQs are two different methods of performing "bottom half" (lesser priority work). I understand this (quite genuine need).
Difference being, SoftIRQs are re-entarant while a Tasklet is NOT. That same SoftIRQ can run on different CPUs while this is NOT the case with Tasklets.
Though I understand this from surface but I fail in understanding the requirements of the two features. In what case(s) we may use these facilities ? How to recognize that I should use Tasklets now and SoftIRQs then.
Also what do we mean by Tasklets are made upon SoftIRQs ? In one of the books I read in LKML there were debates upon removing Tasklets. I got completely confused why one would bring in such a feature ? Some shortsightedness (No offense meant) ?
Any pointers on this will help a lot.

include/linux/interrupt.h
/* PLEASE, avoid to allocate new softirqs, if you need not _really_ high
frequency threaded job scheduling. For almost all the purposes
tasklets are more than enough. F.e. all serial device BHs et
al. should be converted to tasklets, not to softirqs.
*/
enum
{
HI_SOFTIRQ=0, /* High Priority */
TIMER_SOFTIRQ,
NET_TX_SOFTIRQ,
NET_RX_SOFTIRQ,
BLOCK_SOFTIRQ,
BLOCK_IOPOLL_SOFTIRQ,
TASKLET_SOFTIRQ,
SCHED_SOFTIRQ,
HRTIMER_SOFTIRQ,
RCU_SOFTIRQ, /* Preferable RCU should always be the last softirq */
NR_SOFTIRQS
};
The key differences between softirq and tasklet are:
Allocation
Softirqs are statically allocated at compile-time. Unlike tasklets, you cannot dynamically register and destroy softirqs.
Tasklets can be statically allocated using DECLARE_TASKLET(name, func, data) or can also be allocated dynamically and initialized at runtime using tasklet_init(name, func, data)
Concurrency
Softirqs can run concurrently on several CPUs, even if they are of the same type because softirqs are reentrant functions and must explicitly protect their data structures with spinlocks.
Tasklets are non-reentrant and tasklets of the same type are always serialized: in other words, the same type of tasklet cannot be executed by two CPUs at the same time. However, tasklets of different types can be executed concurrently on several CPUs.
Processing
Softirqs are activated by means of the raise_softirq(). The pending softirqs are processed by do_softirq() and ksoftirqd kernel thread after being enabled by local_bh_enable() or by spin_unlock_bh()
Tasklets are a bottom-half mechanism built on top of softirqs i.e. tasklets are represented by two softirqs: HI_SOFTIRQ and TASKLET_SOFTIRQ. Tasklets are actually run from a softirq. The only real difference in these types is that the HI_SOFTIRQ based tasklets run prior to the TASKLET_SOFTIRQ tasklets. So, tasklet_schedule() basically calls raise_softirq(TASKLET_SOFTIRQ)
Note that softirqs (and hence tasklets and timers) are run on return from hardware interrupts, or on return from a system call. Also as soon as the thread that raised the softirq ends, that single softirq (and on other) is run to minimize softirq latency.

Sofirqs are re-entrant , that is the different CPU can take the same softirq and execute it while the Tasklets are serialized that is the same CPU which is running the tasklet has the right to complete it , no other CPU can take it(in case of scheduling).
refer this excellent article.
Also you can enable/disable the defer processing by using the local_bh_enable() on the local CPU which actually makes the _ _local_bh_count non zero.
Also read this book (free downloadable) Page number 131 - which explains the difference as well as explaination using the code example with a fake/dummy device - roller.

Softirqs are statically allocated at compile- time. Unlike tasklets, you cannot dynamically register and destroy softirqs.Tasklets are similar to softirqs (working) however, they have a simpler interface.
Softirqs are required only for very high frequency and highly threaded uses , whereas , tasklets do just fine in any other case.

Tasklets are implemented on top of softirq's, so they are softirq's. they are represented by two softirq's "HI_SOFTIRQ & TASKLET_SOFTIRQ" difference is priority.
Even though they are implemented on top of softirq's they differ in:
Tasklets can be created/destroyed statically or dynamically but softirq's are only by static way.
Two different tasklets can run concurrently on same cpu. But two of the same type of tasklets can not run on same cpu. Whereas softirq's are in the other way.
Softirq's are reserved for most time critical & important bottom half processing on system.

Related

Program counter, fences and processor re-ordering

I understand that instructions can be re-ordered by the processor in addition to compilers.
I have a few questions that I can not get my head around.
Say we have three instructions:
Program order
S1
S2
S3
After re-ordering by the processor, order becomes (for whatever reason):
S3
S2
S1
So when the processor executes S1 (in the program order), what woul be the value of the Program Counter?
If windows (or another OS), context switches the thread out and schedules it in another processor, how would the other processor know which instruction to execute next? (Is it guaranteed to make the same re-orderings?)
Is a memory fence (for example, a full fence created by an atomic compare and swap instruction) on one processor valid after the thread is scheduled on another thread?
Any ideas on this is highly appreciated.
There is an instruction pointer associated with each instruction.
Although instructions may be executed out of order, they always complete in order. When an interrupt or fault occurs, all instructions preceding the saved IP address have been completed. The results of any subsequent instructions are discarded. When execution resumes, it starts at the saved address.
The steps taken by the OS to schedule a thread on another processor include fencing operations on both processors, so when the thread resumes on the new processor, all preceding operations are fully fenced (whether or not any explicit fences exist in the code of the thread).
Unlike static compile-time ordering, out-of-order exec preserves the illusion of running instructions in program order. Including the situation seen by an interrupt handler. Current CPUs don't rename the privilege level, so they generally roll back to a consistent state as part of taking an exception or interrupt, not keeping un-executed instructions in flight. When an interrupt occurs, what happens to instructions in the pipeline?
This also means that interrupts are delivered strictly between instructions, not in the middle of one. Interrupting an assembly instruction while it is operating (except for "interruptible" instructions like rep movsb that logically work as multiple instructions, or vpgatherdd that has documented semantics for a page fault in one of the gather operands.)
Memory ordering as observed by other cores is another matter, and can differ from program order even on an in-order CPU. (Can a speculatively executed CPU branch contain opcodes that access RAM?)
The kernel code for a context switch needs to include a strong enough barrier for a thread to see its own stores in program order when it resumes on another core. Generally just release/acquire sync is sufficient (and you already need something like that for the kernel on the other core to restore register values). Maybe also an sfence to make that apply even for NT stores on x86.

On linux kernel can atomic operations for eg atomic_inc, atomic_dec etc protect a variable under multi core environment?

Atomic operations protect a variable in a multi threading environment, but is it suitable for mutlicore environment?
Yes, it does. They are typically implemented via atomic memory bus operations and so will work just the same for a multi-core scenario.
In fact, if you know the data you are protecting is only accessed by different threads (tasks) on the same core, it is probably cheaper to implement the protection via other means, such as disable preemption and/or interrupts. Atomic operation are specifically meant for situation where that is not enough. such as multi-core systems.
Multi-threaded essentially means that there are multiple tasks running (processes). According to Wikipedia:
Atomicity is a guarantee of isolation from interrupts, signals, concurrent processes and threads.
This is because these operations are treated as if they are one since it is not being interrupted by anything. Therefore, multiple threads can perform these operations of course, but only one at a time because a processor can only perform one operation, or atomic operation at a time.
The same logic goes for multi-core processes where there are multiple processors trying to access the same data. This is done through mutual exclusion which ensures that a critical code block never gets accessed more than once at the same time. In software terms, this means that it uses locks to ensure that the multiple processors cannot access it while in use.

Processor pipeline state preservation

Is there any situation where the state of the processor pipeline (with already decoded or prefetched instructions) is saved and subsequently reloaded after resumption during a thread sleep/ context switch / interrupt etc.? (May be as a optimization).
This isn't possible for any CPU I'm aware of. There's no interface for doing it, and no conditions under which a CPU does it on its own. Dumping a huge amount of internal CPU state to RAM would take more cycles than it would save. Having the OS keep track of the variable-size chunks of RAM needed for this would just make the overhead worse.
If anything was worth saving, BTW, it would be results of already executed instruction that can't retire yet, because of a load that missed in cache. (All the common out-of-order execution designs for mainstream ISAs use in-order retirement to support precise exceptions. Out-of-order retirement with checkpointing / rollback on exceptions and mispredicts has been proposed. Search kilo-instruction processor, IIRC.)
(flawed idea): An aggressive out-of-order design could avoid wasting too much work on context switches by delaying the write of the interrupt-return address when an external interrupt arrives. i.e. they could pretend that the interrupt came in later than it did by allowing some instructions already in the pipeline to keep executing. If the user-space instruction pointer isn't needed until the interrupt handler returns, the CPU could clearing the pipeline.
Hrm, this has the major difficulty that register values on entry into the interrupt handler also depends on the architectural state, so this probably can't work.
This def. can't work for interrupts generated by user-space, because that fixes the return address.
This isn't an issue for threads that put themselves to sleep while waiting on a spinlock with monitor / mwait or something. mwait presumably doesn't take effect until it retires, and it won't retire until all previous work has been done. It would defeat the intended purpose for the CPU to be aggressive about speculatively executing past mwait, I think. Or maybe mwait doesn't even flush the pipeline, and just saves power.
The idea has been proposed, but you'd need a much denser memory technology which is only now becoming available. See this paper for example:
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6489970
Basically, they propose pipes composed of a new set of latches & registers based on memristors (resistive non-volatile memory components), that can hold multiple values corresponding to multiple threads. Control logic can then tell all latches which thread should be active, and allow simultaneous context switching throughout the entire pipe.
Keep in mind that this only enhances the granularity to the latch level. Modern CPUs with simultaneous multithreading can already have different threads active on different units level without context switches, simple through arbitration. Other units with inherent parallelism may already handle multiple threads per cycle (e.g. - multi-ported ALUs)

Could process running only on one processor have threads running on other processors?

Is it possible, in multiprocessor environment (PC) that one windows process is configured to run only on one processor (affinity mask = 1 or SetProcessAffinityMask(GetCurrentProcess(),1)), but its thread are spawned on other processors?
(Question came from discussion started in one company, regarding using synchronization objects (Events, Mutexes, Semaphores) and WinAPIs, like WaitForSignleObject, etc, especially SignalObjectAndWait for which MSDN states
"Note that the "signal" and "wait" are not guaranteed to be performed
as an atomic operation. Threads executing on other processors can
observe the signaled state of the first object before the thread
calling SignalObjectAndWait begins its wait on the second object"
Does it mean that for single processor it's guaranteed to be atomic?
P.S. Is there any differences for Windows Context Switching that there are multiple processors or single processor with more real cores?
P.P.S. Please be patient with this question if I didn't use exact and concrete terms - this are is still not very good known for me.
No.
The set of processor cores a thread can run on is the intersection of the process affinity mask and the thread affinity mask.
To get the behavior you describe, one would set the thread affinity mask for the main thread, and not mess with the process mask.
For your followup questions: If it isn't atomic, it isn't atomic. There are additional guarantees for threads sharing a core, because preemption follows certain rules, but they are very complex, since relative priority and dynamic priority are important factors in thread scheduling. Because of the complexity, it is best to use proper synchronization.
Notably, race conditions between threads of equal priority certainly still exist on a single core (or single core restricted) system, but they are far less frequent and therefore far more difficult to find and debug.
Is it possible, in multiprocessor environment (PC) that one windows process is configured to run only on one processor (affinity mask = 1 or SetProcessAffinityMask(GetCurrentProcess(),1)), but its thread are spawned on other processors?
If not set cpu affinity to only one core, one process could run on multiple cores?
What's the difference between processes and threads?
Thread could have processes or process could have threads?
Could process seen from a thread point of view or vice verse?
What is atomic notion?
when number 1 could seen as multidimensional unit?
Could we divide 1/0 (to zero)? When could we or couldn't?
Does it mean that for single processor it's guaranteed to be atomic?
One cpu: do you remember: run and stay resident? Good old time!
Then Unix: multiprocessing, multithreading, etc. :)
Note:
You couldn't ask a question without knowing answer to that question.
Try to ask something you don't know, that's impossible! You're asking because you have an answer. Look inside your question. Answer is evident. :)

tasklet advantage in userspace application

Got some doubts with bottom half.Here, I consider tasklets only.
Also , I consider non-preemptible kernel only.
Suppose consider a ethernet driver in which rx interrupt processing is doing some 10 functions calls.(bad programming :) )
Now, looking at performance perspective if 9 function calls can be moved to a tasklet and only 1 needs to be called in interrupt handling , Can I really get some good performance in a tcp read application.
Or in other words, when there is switch to user space application all the 9 function calls for the tasklets scheduled will be called, in effective the user space application will be able to get the packet cum data only after "all the taskets scheduled" are completed ? correct?
I understand that by having bottom half , we are enabling all interrupts .. but I have a doubt whether the application that relies on the interrupt actually gain anything by having the entire 10 functions in interrupt handler itself or in the bottom half.
In Short, by having tasklet do I gain performance improvement in user space application ,here ?
Since tasklets are not queued but scheduled, i.e. several hardware interrupts posting the same tasklet might result in a single tasklet function invocation, you would be able to save up to 90% of the processing in extreme cases.
On the other hand there's already a high-priority soft IRQ for net-rx.
In my experience on fast machines, moving work from the handler to the tasklet does not make the machine run faster. I've added macros in the handler that can turn my schedule_tasklet() call into a call to the tasklet function itself, and it's easy to benchmark both ways and see the difference.
But it's important that interrupt handlers finish quickly. As Nikolai mentioned, you might benefit if your device likes to interrupt a lot, but most high-bandwidth devices have interrupt mitigation hardware that makes this a less serious problem.
Using tasklets is the way that core kernel people are going to do things, so all else being equal, it's probably best to follow their lead, especially if you ever want to see your driver accepted into the mainline kernel.
I would also note that calling lots of functions isn't necessarily bad practice; modern branch predictors can make branch-heavy code run just as fast as non-branch-heavy code. Far more important in my opinion are the potential cache effects of having to do half the job now, and then half the job later.
A tasklet does not run in context of the user process. If your ISR schedules a tasklet, it will run immediately after your isr is done, but with interrupts enabled. The benefit of this is that your packet processing is not preventing additional interrupts.
In your TCP example, the hardware hands off the packet to the network stack and your driver is done -- the net stack handles waking up the process etc. so there really no way for the hw's driver to execute in the process context of the data's recipient, because the hw doesn't even know who that is.

Resources