What is difference between spin_lock and spin_lock_bh - linux-kernel

I want to understand difference between spin_lock and spin_lock_bh.
As per
https://www.kernel.org/doc/html/v4.15/kernel-hacking/locking.html#cheat-sheet-for-locking
if critical section is between softirq and userspace process
we should use spin_lock_bh.
if between two softirq
we should use spin_lock.

Related

Program counter, fences and processor re-ordering

I understand that instructions can be re-ordered by the processor in addition to compilers.
I have a few questions that I can not get my head around.
Say we have three instructions:
Program order
S1
S2
S3
After re-ordering by the processor, order becomes (for whatever reason):
S3
S2
S1
So when the processor executes S1 (in the program order), what woul be the value of the Program Counter?
If windows (or another OS), context switches the thread out and schedules it in another processor, how would the other processor know which instruction to execute next? (Is it guaranteed to make the same re-orderings?)
Is a memory fence (for example, a full fence created by an atomic compare and swap instruction) on one processor valid after the thread is scheduled on another thread?
Any ideas on this is highly appreciated.
There is an instruction pointer associated with each instruction.
Although instructions may be executed out of order, they always complete in order. When an interrupt or fault occurs, all instructions preceding the saved IP address have been completed. The results of any subsequent instructions are discarded. When execution resumes, it starts at the saved address.
The steps taken by the OS to schedule a thread on another processor include fencing operations on both processors, so when the thread resumes on the new processor, all preceding operations are fully fenced (whether or not any explicit fences exist in the code of the thread).
Unlike static compile-time ordering, out-of-order exec preserves the illusion of running instructions in program order. Including the situation seen by an interrupt handler. Current CPUs don't rename the privilege level, so they generally roll back to a consistent state as part of taking an exception or interrupt, not keeping un-executed instructions in flight. When an interrupt occurs, what happens to instructions in the pipeline?
This also means that interrupts are delivered strictly between instructions, not in the middle of one. Interrupting an assembly instruction while it is operating (except for "interruptible" instructions like rep movsb that logically work as multiple instructions, or vpgatherdd that has documented semantics for a page fault in one of the gather operands.)
Memory ordering as observed by other cores is another matter, and can differ from program order even on an in-order CPU. (Can a speculatively executed CPU branch contain opcodes that access RAM?)
The kernel code for a context switch needs to include a strong enough barrier for a thread to see its own stores in program order when it resumes on another core. Generally just release/acquire sync is sufficient (and you already need something like that for the kernel on the other core to restore register values). Maybe also an sfence to make that apply even for NT stores on x86.

Barrier between memory sections

I'm doing a research about how memory is managed in RTEMS using an ARM-based Xilinx Zynq. The program runs on two cores with SMP.
I have read about memory barriers and out-of-order execution paradigm, I concluded that a barrier or a fence is a hardware implementation rather than software.
RAM is divided in several sections, however there are some sections called barriers which shared areas with other sections. I attach you a capture.
xbarrier starts where the next section begins and ends where previous section ends. Another example:
In this one, the barrier starts at the same addres as the previous section and it ends before the next section starts.
Are these memory sections related with barrier instructions? Why are these memory sections implemented?
Thanks in advance,
Googling "section .rwbarrier" will get you to https://lists.rtems.org/pipermail/users/2015-May/028893.html, which says:
This section helps to protect the code and read-only sections from write access via the MMU.
It looks like this is not linked to barrier instructions at all. Could it be a section of memory which is called like this just to separate a region which is read-write from a region which is read-only (vector) ?
The barrier instructions are used to force order in a multiprocessor system, they will never be linked to an address. The barrier instruction is used to split the visibility (For other CPUs or threads) between:
Load and store instructions before the barrier
Load and store instructions after the barrier.

__threadfence implies the effect of __syncthreads?

I'm implementing parallel reduction in CUDA.
The kernel has a __syncthreads to wait for all threads to complete 2 reads from shared memory, which would then write back the sum to the shared memory.
Should I use a __threadfence_block to ensure that writes to shared memory are visible to all threads for the next iteration , or use __syncthreads as given in NVIDIA's example ?
__syncthreads() implies a memory fence function as well. This is covered in the documentation:
waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to __syncthreads() are visible to all threads in the block.
So in this case it would not be necessary to use __threadfence_block() in addition to __syncthreads()
You cannot substitute a threadfence function for the execution barrier in the usual general parallel reduction. The execution barrier (__syncthreads()) is required in addition to the memory fencing function. In the general case, it's generally necessary to wait for all threads to execute a given round of reduction before proceeding with the next round; __threadfence_block() by itself will not force warps to wait while other warps are executing a given round of reduction.
Therefore __syncthreads() is generally required, and assuming you have used it properly, the __threadfence_block() is generally not required.
__syncthreads() implies __threadfence_block().
__threadfence_block() does not imply __syncthreads()

Why do we need Interrupt context?

I am having doubts, why exactly we need interrupt context? Everything tells what are the properties but no one explains why we come up with this concept?
Another doubt related to same concept is, If we are not disabling the interrupt in interrupt handler, then what is the use of running this interrupt handler code in interrupt context ?
The interrupt context is fundamentally different from the process context:
It is not associated with a process; a specific process does not serve interrupts, the kernel does. Even if a process will be interrupted, it has no significance over any parameters of the interrupt itself or the routine that will serve it. It follows that at the very least, interrupt context must be different from process context conceptually.
Additionally, if an interrupt were to be serviced in a process context, and (re-) scheduled some work at a later time, what context would that run in? The original process may not even exist at that later time. Ergo, we need some context which is independent from processes for a practical reason.
Interrupt handling must be fast; your interrupt handler has interrupted (d'oh) some other code. Significant work should be pushed outside the interrupt handler, onto the "bottom half". It is unacceptable to block a process for work which is not even remotely its concern, either in user or in kernel space.
Disabling the interrupt is something you can (actually could, before 2.6.36) request to be disabled when registering your ISR. Recall that a handler can serve interrupts on multiple CPUs simultaneously, and can thus race with itself. Non-Maskable Interrupts (NMIs) can not be disabled.
Why do we need Interrupt context?
First, what do we mean by interrupt context? A context is usually a state. There are two separate concepts of state.
CPU context
Every CPU architecture has a mechanism for handling interrupts. There maybe a single interrupt vector called for every system interrupt, or the CPU/hardware may be capable of dispatching the CPU to a particular address based on the interrupt source. There are also mechanisms for masking/unmasking interrupts. Each interrupt maybe masked individually, or there maybe a global mask for the entire CPU(s). Finally, there is an actual CPU state. Some may have separate stacks, register sets, and CPU modes implying some memory and other privileges. Your question is about Linux in general and it must handle all cases.
Linux context
Generally all of the architectures have a separate kernel stack, process context (ala ps) and VM (virtual memory) context for each process. The VM has different privileges for user and kernel modes. In order for the kernel to run all the time, it must remain mapped for all processes on a device. A kernel thread is a special case that doesn't care so much about the VM, because it is privileged and can access all kernel memory. However, it does have a separate stack and process context. User registers are typically stored upon the kernel stack when exceptions happen. Exceptions are at least page faults, system calls and interrupts. These items may nest. Ie, you may call write() from user space and while the kernel is transferring a user buffer, it may page fault to read some swapped out user space data. The page fault may again have to service an interrupt.
Interrupt recursion
Linux general wants you to leave interrupts masked as the VM, the execptions, and process management (context and context switching) have to work together. In order to keep things simple for the VM, the kernel stack and process context are generally rooted in either a single 4k (or 8k) area which is a single VM page. This page is always mapped. Typically, all CPUs will switch from interrupt mode to system mode when servicing an interrupt and use the same kernel stack as all other exceptions. The stack is small so to allow recursion (and large stack allocation) can blow up the stack resulting in stack overflows at the kernel level. This is bad.
Atomicity
Many kernel structures need to stay consistent over multiple bus cycles; Ie, a linked list must update both prev and next node links when adding an element. A typical mechanism to do this maybe to mask interrupts, to ensure the code is atomic. Some CPUs may allow bus locking, but this is not universal. The context switching code must also be atomic. A consequence of an interrupt is typically rescheduling. Ie, a kernel interrupt handler may have acked a disk controller and started a write operation. Then a kernel thread may schedule to write more buffered data from the original user space write().
Interrupts occurring at any time can break some sub-sytem's assumptions of atomic behavior. Instead of allowing interrupt to use the sub-system, they are prohibited from using it.
Summary
Linux must handle three thing. The current process execution context, the current virtual memory layout and hardware requests. They all need to work together. As the interrupts may happen at any time, they occur in any process context. Using sleep(), etc in an interrupt would put random processes to sleep. Allowing large stack allocation in an interrupt could blow up the limited stack. These design choices limit what can happen in a Linux interrupt handler. Various configuration options can allow re-entrant interrupts, but this is often CPU specific.
A benefit of keeping the top half, now the main interrupt handler small is that interrupt latency is reduced. Busy work should be done in a kernel thread. An interrupt service routine that would need to un-mask interrupts is already somewhat anti-social to the Linux eco-system. That work should be put in a kernel thread.
The Linux interrupt context really doesn't exist in some sense. It is only a CPU interrupt which may happen in any process context. The Linux interrupt context is actually a set of coding limitations that happen as a consequence of this.

Difference between SoftIRQs and Tasklets

While studying Linux interrupt handling I found that Tasklets and SoftIRQs are two different methods of performing "bottom half" (lesser priority work). I understand this (quite genuine need).
Difference being, SoftIRQs are re-entarant while a Tasklet is NOT. That same SoftIRQ can run on different CPUs while this is NOT the case with Tasklets.
Though I understand this from surface but I fail in understanding the requirements of the two features. In what case(s) we may use these facilities ? How to recognize that I should use Tasklets now and SoftIRQs then.
Also what do we mean by Tasklets are made upon SoftIRQs ? In one of the books I read in LKML there were debates upon removing Tasklets. I got completely confused why one would bring in such a feature ? Some shortsightedness (No offense meant) ?
Any pointers on this will help a lot.
include/linux/interrupt.h
/* PLEASE, avoid to allocate new softirqs, if you need not _really_ high
frequency threaded job scheduling. For almost all the purposes
tasklets are more than enough. F.e. all serial device BHs et
al. should be converted to tasklets, not to softirqs.
*/
enum
{
HI_SOFTIRQ=0, /* High Priority */
TIMER_SOFTIRQ,
NET_TX_SOFTIRQ,
NET_RX_SOFTIRQ,
BLOCK_SOFTIRQ,
BLOCK_IOPOLL_SOFTIRQ,
TASKLET_SOFTIRQ,
SCHED_SOFTIRQ,
HRTIMER_SOFTIRQ,
RCU_SOFTIRQ, /* Preferable RCU should always be the last softirq */
NR_SOFTIRQS
};
The key differences between softirq and tasklet are:
Allocation
Softirqs are statically allocated at compile-time. Unlike tasklets, you cannot dynamically register and destroy softirqs.
Tasklets can be statically allocated using DECLARE_TASKLET(name, func, data) or can also be allocated dynamically and initialized at runtime using tasklet_init(name, func, data)
Concurrency
Softirqs can run concurrently on several CPUs, even if they are of the same type because softirqs are reentrant functions and must explicitly protect their data structures with spinlocks.
Tasklets are non-reentrant and tasklets of the same type are always serialized: in other words, the same type of tasklet cannot be executed by two CPUs at the same time. However, tasklets of different types can be executed concurrently on several CPUs.
Processing
Softirqs are activated by means of the raise_softirq(). The pending softirqs are processed by do_softirq() and ksoftirqd kernel thread after being enabled by local_bh_enable() or by spin_unlock_bh()
Tasklets are a bottom-half mechanism built on top of softirqs i.e. tasklets are represented by two softirqs: HI_SOFTIRQ and TASKLET_SOFTIRQ. Tasklets are actually run from a softirq. The only real difference in these types is that the HI_SOFTIRQ based tasklets run prior to the TASKLET_SOFTIRQ tasklets. So, tasklet_schedule() basically calls raise_softirq(TASKLET_SOFTIRQ)
Note that softirqs (and hence tasklets and timers) are run on return from hardware interrupts, or on return from a system call. Also as soon as the thread that raised the softirq ends, that single softirq (and on other) is run to minimize softirq latency.
Sofirqs are re-entrant , that is the different CPU can take the same softirq and execute it while the Tasklets are serialized that is the same CPU which is running the tasklet has the right to complete it , no other CPU can take it(in case of scheduling).
refer this excellent article.
Also you can enable/disable the defer processing by using the local_bh_enable() on the local CPU which actually makes the _ _local_bh_count non zero.
Also read this book (free downloadable) Page number 131 - which explains the difference as well as explaination using the code example with a fake/dummy device - roller.
Softirqs are statically allocated at compile- time. Unlike tasklets, you cannot dynamically register and destroy softirqs.Tasklets are similar to softirqs (working) however, they have a simpler interface.
Softirqs are required only for very high frequency and highly threaded uses , whereas , tasklets do just fine in any other case.
Tasklets are implemented on top of softirq's, so they are softirq's. they are represented by two softirq's "HI_SOFTIRQ & TASKLET_SOFTIRQ" difference is priority.
Even though they are implemented on top of softirq's they differ in:
Tasklets can be created/destroyed statically or dynamically but softirq's are only by static way.
Two different tasklets can run concurrently on same cpu. But two of the same type of tasklets can not run on same cpu. Whereas softirq's are in the other way.
Softirq's are reserved for most time critical & important bottom half processing on system.

Resources