How are software interrupts triggered in windows when the IRQL drops? - windows
I know that for hardware interrupts, when KeLowerIrql is called by KeAcquireInterruptSpinLock, the HAL adjusts the interrupt mask in the LAPIC, which will allow queued interrupts (in the IRR probably) to be serviced automatically. But with software interrupts, for instance, ntdll.dll sysenter calls to the SSDT NtXxx system services, how are they 'postponed' and triggered when the IRQL goes to passive level Same goes for the DPC dispatcher software interrupt (if the DPC is for the current CPU and of high priority), how is that triggered when IRQL < Dispatch IRQL? Do the software interrupt called functions (NtXxx) in the SSDT all loop on a condition i.e.
while (irql != passive)
Exactly the same question for lazy IRQL:
Because accessing a PIC is a relatively slow operation, HALs that require accessing the I/O bus to change IRQLs, such as for PIC and 32-bit Advanced Configuration and Power Interface (ACPI) systems, implement a performance optimization, called lazy IRQL, that avoids PIC accesses. When the IRQL is raised, the HAL notes the new IRQL internally instead of changing the interrupt mask. If a lower-priority interrupt subsequently occurs, the HAL sets the interrupt mask to the settings appropriate for the first interrupt and does not quiesce the lower-priority interrupt until the IRQL is lowered (thus keeping the interrupt pending). Thus, if no lower-priority interrupts occur while the IRQL is raised, the HAL doesn’t need to modify the PIC.
How does it keep this interrupt pending? Does it just loop on a condition until the higher priority ISR lowers the IRQL and when the thread is scheduled in, the condition will eventually be met? Is it just that simple?
Edit: I must be missing out on something here because let's say an ISR at Device IRQL requests a DPC using IoRequestDpc, if it is a high priority DPC and the target is the current processor then it schedules an interrupt of DPC/Dispatch level to drain the processor's DPC queue. This is all happening in the ISR which is at Device IRQL (DIRQL) which means that the software interrupt with Dispatch/DPC IRQL level will spin at KeAcquireInterruptSpinLock I think because the current IRQL is too high, but wouldn't it be spinning there forever because the actual routine to lower the IRQL is called after the ISR returns meaning that it's going to stay stuck in the ISR at Device IRQL waiting on that software interrupt which requires IRQL < Dispatch/DPC IRQL (2), not only that, the dispatcher will not be able to dispatch the next thread because the dispatch DPC runs at Dispatch/DPC IRQL level which is far lower. There 1 solution I can think of.
1) The ISR returns the KDPC object to the KiInterruptDispatch so that it knows what priority the DPC is and then schedules it itself after it has lowered the IRQL using KeReleaseInterruptSpinLock but KSERVICE_ROUTINE only returns an unrelated boolean value so this is ruled out.
Does anyone know how this situation is avoided?
Edit 2: Perhaps it spawns a new thread that blocks waiting for IRQL < Dispatch IRQL and then returns from the ISR and drops the IRQL.
This is something that isn't really explained explicitly on any source and interestingly enough the second comment also asks the same question.
Firstly, DPC software interrupts aren't like regular SSDT software interrupts, which are not postponed and run at passive IRQL and can be interrupted at any time. DPC software interrupts do not use int or syscall or anything like that, are postponed and run at dispatch level.
After studying the ReactOS kernel and WRK, I now know exactly what happens
A driver, when it receives IRP_MN_START_DEVICE from the PnP manager, initialises an interrupt object using IoConnectInterrupt using the data in the CM_RESOURCE_LIST it receives in the IRP. Of particular interest is the vector and affinity that was assigned by the PnP manager to the device (which is simple to do if the device exposes an MSI capability in its PCIe configuration space as it doesn't have to worry about underlying IRQ routing). It passes the vector, a pointer to an ISR, context for the ISR, IRQL to IoConnectInterrupt which calls KeInitializeInterrupt to initialise the interrupt object using the parameters and then it calls KeConnectInterrupt which switches the affinity of the current thread to the target processor, locks the dispatcher database and checks that that IDT entry points to a BugCheck wrapper KxUnexpectedInterrupt0[IdtIndex]. If it is then it raises IRQL to 31 so the following is an atomic operation and uses the HAL API to enable the vector that was mapped by the PnP manager on the LAPIC and assign it a TPR priority level corresponding to the IRQL. It then maps the vector to the handler address in the IDT entry for ther vector. To do this it passes the address &Interrupt->DispatchCode[0] into the IDT mapping routine KeSetIdtHandlerAddress. It appears this is a template which is the same for all interrupt objects which according to the WRK is KiInterruptTemplate. Sure enough, checking ReactOS kernel, we see in KeInitializeInterrupt -- which is called by IoConnectInterrupt -- the code:
RtlCopyMemory(Interrupt->DispatchCode,
KiInterruptDispatchTemplate,
sizeof(Interrupt->DispatchCode));
KiInterruptDispatchTemplate appears to be blank for now, because ReactOS's amd64 port is in early development. On windows it will be implemented however and as the KiInterruptTemplate.
It then lowers the IRQL back to the old IRQL. If the IDT entry did not point to a BugCheck ISR then it initialises a chained interrupt -- because there was already an address at the IDT entry. It uses CONTAINING_RECORD to acquire the interrupt object by its member, the address of the handler (DispatchCode[0]) and connects the new interrupt object to the one already present, initialising the interrupt object already referenced's LIST_ENTRY as the head of the list and marking it as a chained interrupt by setting the DispatchAddress member to the address of KiChainedDispatch. It then drops the dispatcher database spinlock and switches the affinity back and returns the interrupt object.
The driver then sets up a DPC -- with the DeferredRoutine as a member -- for the Device Object using IoInitializeDpcRequest.
FORCEINLINE VOID IoInitializeDpcRequest ( _In_ PDEVICE_OBJECT DeviceObject, _In_ PIO_DPC_ROUTINE DpcRoutine )
KeInitializeDpc(&DeviceObject->Dpc,
(PKDEFERRED_ROUTINE) DpcRoutine,
DeviceObject);
KeInitializeDpc calls KiInitializeDpc which is hard-coded to set the priority to medium which means that KeInsertQueueDpc will place it in the middle of the DPC queue. KeSetImportanceDpc and KeSetTargetProcessorDpc can be used after the call to set the returned DPC that was generated's priority and target processor respectively. It copies a DPC object to the member of the device object and if there is already a DPC object then it queues it to the DPC already present.
When the interrupt happens, the KiInterruptTemplate template of the interrupt object is the address in the IDT that gets called which will then call the real interrupt dispatcher which is the DispatchAddress member which will be KiInterruptDispatch for a normal interrupt or KiChainedDispatch for a chained interrupt. It passes the interrupt object to KiInterruptDispatch (it can do this because, as we saw earlier, RtlCopyMemory copied KiInterruptTemplate into the interrupt object, this means that it can use an asm block with a relative RIP to acquire the address of the interrupt object it belongs to (it could also attempt to do something with the CONTAINING_RECORD function) but intsup.asm contains the following code to do it : lea rbp, KiInterruptTemplate - InDispatchCode ; get interrupt object address jmp qword ptr InDispatchAddress[rbp]; finish in common code). KiInterruptDispatch will then acquire the interrupt's spinlock, probably using KeAcquireInterruptSpinLock. The ISR (ServiceContext) calls IoRequestDpc with the device object address that was created for the device and ISR, as a parameter, along with interrupt specific context and an optional IRP (which I'm guessing it gets from the head at DeviceObject->Irp if the routine is meant to handle an IRP). I expected it to be a single line wrapper of KeInsertQueue but passing the Dpc member of the device object instead and that's exactly what it is: KeInsertQueueDpc(&DeviceObject->Dpc, Irp, Context);. Firstly KeInsertQueue raises the IRQL from the device IRQL of the ISR to 31 which prevents all preemption. WRK contains the following on line 263 of dpcobj.c:
#if !defined(NT_UP)
if (Dpc->Number >= MAXIMUM_PROCESSORS) {
Number = Dpc->Number - MAXIMUM_PROCESSORS;
TargetPrcb = KiProcessorBlock[Number];
} else {
Number = CurrentPrcb->Number;
TargetPrcb = CurrentPrcb;
}
Which suggests that the DPC->Number member must be set by KeSetTargetProcessorDpc as target core number + maximum processors. This is bizarre and sure enough I went and looked at ReactOS's KeSetTargetProcessorDpc and it does! KiProcessorBlock appears to be a kernel structure for fast-accessing the KPRCB structures for each of the cores.
It then gets the core's normal DPC queue spinlock using DpcData = KiSelectDpcData(TargetPrcb, Dpc) which returns &Prcb->DpcData[DPC_NORMAL] as the type of the DPC it passed to it is normal, not threaded. It then acquires the spinlock for the queue and this appears to be an empty function body on ReactOS and I think it's because of this:
/* On UP builds, spinlocks don't exist at IRQL >= DISPATCH */
And that makes sense because ReactOS only supports 1 core meaning there is no thread on another core that can access the DPC queue (a core might have a target DPC for this core's queue). There is only one DPC queue. If it were a multicore system it would have to acquire the spinlock so these look to be placeholders for when multicore functionality is implemented. If it failed to acquire the spinlock for the DPC queue then it would either spin-wait at IRQL 31 or drop to the IRQL of the interrupt itself and spinwait, allowing other interrupts to occur to the core but no other threads to run on the core.
Note that windows would use KeAcquireSpinLockAtDpcLevel to acquire this spinlock, ReactOS does not. KeAcquireSpinLockAtDpcLevel does not touch the IRQL. Although, in the WRK it directly uses KiAcquireSpinLock which can be seen on line 275 of dpcobj.c which only acquires the spinlock and does nothing to the IRQL (KiAcquireSpinLock(&DpcData->DpcLock);).
After acquiring the spinlock it firstly ensures that the DPC object isn't already on a queue (DpcData member would be null when it does a cmpxchg to initialise it with the DpcData returned from KiSelectDpcData(TargetPrcb, Dpc)) and if it is it drops the spinlock and returns; otherwise, it then sets the DPC members to point to the interrupt specific context that was passed and then it inserts it into the queue either at the head (InsertHeadList(&DpcData->DpcListHead, &Dpc->DpcListEntry);) or the tail (InsertTailList(&DpcData->DpcListHead, &Dpc->DpcListEntry);)based on its priority (if (Dpc->Importance == HighImportance)). It then makes sure that a DPC isn't executing already if (!(Prcb->DpcRoutineActive) && !(Prcb->DpcInterruptRequested)). It then checks if KiSelectDpcData returned the second KDPC_DATA structure i.e. the DPC was of type threaded (if (DpcData == &TargetPrcb->DpcData[DPC_THREADED])) and if it is and if ((TargetPrcb->DpcThreadActive == FALSE) && (TargetPrcb->DpcThreadRequested == FALSE)) then it does a locked xchg to set TargetPrcb->DpcSetEventRequest to true respectively and then it sets TargetPrcb->DpcThreadRequested and TargetPrcb->QuantumEnd to true and it sets RequestInterrupt to true if the target PRCB is the current PRCB otherwise it only sets it to true if the target core is not idle.
Now comes the crux of the original question. The WRK now contains the following code:
#if !defined(NT_UP)
if (CurrentPrcb != TargetPrcb) {
if (((Dpc->Importance == HighImportance) ||
(DpcData->DpcQueueDepth >= TargetPrcb->MaximumDpcQueueDepth))) {
if (((KiIdleSummary & AFFINITY_MASK(Number)) == 0) ||
(KeIsIdleHaltSet(TargetPrcb, Number) != FALSE)) {
TargetPrcb->DpcInterruptRequested = TRUE;
RequestInterrupt = TRUE;
}
}
} else {
if ((Dpc->Importance != LowImportance) ||
(DpcData->DpcQueueDepth >= TargetPrcb->MaximumDpcQueueDepth) ||
(TargetPrcb->DpcRequestRate < TargetPrcb->MinimumDpcRate)) {
TargetPrcb->DpcInterruptRequested = TRUE;
RequestInterrupt = TRUE;
}
}
#endif
In essence, on a multiprocessor system, if the target core it acquired from the DPC object is not the current core of the thread then: If the DPC is of high importance or it exceeds the maximum queue depth and the logical and of the target affinity and the idle cores is 0 (i.e. the target core is not idle) and (well, KeIsIdleHaltSet appears to to be exactly the same thing (it checks the Sleeping flag in the target PRCB)) then it sets a DpcInterruptRequested flag in the PRCB of the target core. If the target of the DPC is the current core then if the DPC is not low importance (note: this would allow medium!) or if the DPC queue depth exceeds the maximum queue depth and if the request rate of DPCs on the core hasn't exceeded the minimum it sets a flag in the PRCB of the current core to indicate there is a DPC.
It now releases the DPC queue spinlock: KiReleaseSpinLock(&DpcData->DpcLock);(#if !defined(NT_UP) of course) (which doesn't alter the IRQL). It then checks to see if an interrupt was requested by the procedure (if (RequestInterrupt == TRUE)), then if it is a uniprocessor system (#if defined(NT_UP)) it simply calls KiRequestSoftwareInterrupt(DISPATCH_LEVEL); but if it is a multicore system it needs to check the target PRCB to see if it needs to send an IPI.
if (TargetPrcb != CurrentPrcb) {
KiSendSoftwareInterrupt(AFFINITY_MASK(Number), DISPATCH_LEVEL);
} else {
KiRequestSoftwareInterrupt(DISPATCH_LEVEL);
}
And it speaks for itself what that does; if the current PRCB is not the target PRCB of the DPC then it sends an IPI of DISPATCH_LEVEL priority to the processor number using KiSendSoftwareInterrupt; otherwise, it uses KiRequestSoftwareInterrupt. There is no documentation at all but my guess is this is a Self IPI, and it will wrap a HAL function that programs the ICR to send an IPI to itself at dispatch level priority (my reasoning being ReactOS at this stage calls HalRequestSoftwareInterrupt which shows an unimplemented PIC write). So it's not a software interrupt in the INT sense but is actually, put simply, a hardware interrupt. It then lowers the IRQL back from 31 to the previous IRQL (which was the ISR IRQL). It then returns to the ISR and then it will return to KiInterruptDispatch; KiInterruptDispatch will then release the ISR spinlock using KeReleaseInterruptSpinLock which will reduce the IRQL to what it was before the interrupt and it then pop the trap frame but I would have thought it would first pop the trap frame and then program the LAPIC TPR so the register restore process is atomic but I suppose it doesn't really matter.
ReactOS has the following (WRK doesn't have KeReleaseSpinlock or the IRQL lowering procedures documented so this is the best we have):
VOID NTAPI KeReleaseSpinLock ( KIRQL NewIrql )
{
/* Release the lock and lower IRQL back */
KxReleaseSpinLock(SpinLock);
KeLowerIrql(OldIrql);
}
VOID FASTCALL KfReleaseSpinLock ( PKSPIN_LOCK SpinLock, KIRQL OldIrql )
{
/* Simply lower IRQL back */
KeLowerIrql(OldIrql);
}
KeLowerIrql is a wrapper for the HAL function KfLowerIrql, the function contains KfLowerIrql(OldIrql); and that's it.
VOID FASTCALL KfLowerIrql ( KIRQL NewIrql )
{
DPRINT("KfLowerIrql(NewIrql %d)\n", NewIrql);
if (NewIrql > KeGetPcr()->Irql)
{
DbgPrint ("(%s:%d) NewIrql %x CurrentIrql %x\n",
__FILE__, __LINE__, NewIrql, KeGetPcr()->Irql);
KeBugCheck(IRQL_NOT_LESS_OR_EQUAL);
for(;;);
}
HalpLowerIrql(NewIrql);
}
This function basically prevents the new IRQL being higher than the current IRQL which makes sense because the function is supposed to lower the IRQL. If everything is ok, the function calls HalpLowerIrql(NewIrql); This is a skeleton of a multiprocessor AMD64 implementation -- it does not actually implement the APIC register writes (or MSRs for x2APIC), they are empty functions on ReactOS's multiprocessor AMD64 implementation as it is in development; but on windows, they wont be and they'll actually program the LAPIC TPR so that the queued software interrupt can now occur.
HalpLowerIrql(KIRQL NewIrql, BOOLEAN FromHalEndSystemInterrupt)
{
ULONG Flags;
UCHAR DpcRequested;
if (NewIrql >= DISPATCH_LEVEL)
{
KeSetCurrentIrql (NewIrql);
APICWrite(APIC_TPR, IRQL2TPR (NewIrql) & APIC_TPR_PRI);
return;
}
Flags = __readeflags();
if (KeGetCurrentIrql() > APC_LEVEL)
{
KeSetCurrentIrql (DISPATCH_LEVEL);
APICWrite(APIC_TPR, IRQL2TPR (DISPATCH_LEVEL) & APIC_TPR_PRI);
DpcRequested = __readfsbyte(FIELD_OFFSET(KIPCR, HalReserved[HAL_DPC_REQUEST]));
if (FromHalEndSystemInterrupt || DpcRequested)
{
__writefsbyte(FIELD_OFFSET(KIPCR, HalReserved[HAL_DPC_REQUEST]), 0);
_enable();
KiDispatchInterrupt();
if (!(Flags & EFLAGS_INTERRUPT_MASK))
{
_disable();
}
}
KeSetCurrentIrql (APC_LEVEL);
}
if (NewIrql == APC_LEVEL)
{
return;
}
if (KeGetCurrentThread () != NULL &&
KeGetCurrentThread ()->ApcState.KernelApcPending)
{
_enable();
KiDeliverApc(KernelMode, NULL, NULL);
if (!(Flags & EFLAGS_INTERRUPT_MASK))
{
_disable();
}
}
KeSetCurrentIrql (PASSIVE_LEVEL);
}
Firstly, it checks to see if the new IRQL is above dispatch level, if so, it sets it to it just fine and writes to the LAPIC TPR register and returns. If not, it checks to see if the current IRQL is dispatch level (>APC_LEVEL). It means that by definition, the new IRQL is going to be less than dispatch level. We can see that in this event it makes it equal to DISPATCH_LEVEL rather than letting it drop below and writes it to the LAPIC TPR register. It then checks is HalReserved[HAL_DPC_REQUEST] which appears to be what ReactOS uses instead of DpcInterruptRequested which we saw previously, so just substitute it with that. It then sets it to 0 (note the PCR begins at the start of segment descriptor pointed to by the FS segment in kernel mode). It then enables interrupts and calls KiDispatchInterrupt and after that if the eflags register changed the IF flag during KiDispatchInterrupt it disables interrupts. It then also checks to see if a kernel APC is pending (which is beyond the scope of this explanation) before finally setting the IRQL to passive level
VOID NTAPI KiDispatchInterrupt ( VOID )
{
PKIPCR Pcr = (PKIPCR)KeGetPcr();
PKPRCB Prcb = &Pcr->Prcb;
PKTHREAD NewThread, OldThread;
/* Disable interrupts */
_disable();
/* Check for pending timers, pending DPCs, or pending ready threads */
if ((Prcb->DpcData[0].DpcQueueDepth) ||
(Prcb->TimerRequest) ||
(Prcb->DeferredReadyListHead.Next))
{
/* Retire DPCs while under the DPC stack */
//KiRetireDpcListInDpcStack(Prcb, Prcb->DpcStack);
// FIXME!!! //
KiRetireDpcList(Prcb);
}
/* Re-enable interrupts */
_enable();
/* Check for quantum end */
if (Prcb->QuantumEnd)
{
/* Handle quantum end */
Prcb->QuantumEnd = FALSE;
KiQuantumEnd();
}
else if (Prcb->NextThread)
{
/* Capture current thread data */
OldThread = Prcb->CurrentThread;
NewThread = Prcb->NextThread;
/* Set new thread data */
Prcb->NextThread = NULL;
Prcb->CurrentThread = NewThread;
/* The thread is now running */
NewThread->State = Running;
OldThread->WaitReason = WrDispatchInt;
/* Make the old thread ready */
KxQueueReadyThread(OldThread, Prcb);
/* Swap to the new thread */
KiSwapContext(APC_LEVEL, OldThread);
}
}
Firstly, it disables interrupts _disable is just a wrapper of an asm block that clears the IF flag and has memory and cc in the clobber list (to prevent compiler reordering). This looks like arm syntax though.
{
__asm__ __volatile__
(
"cpsid i # __cli" : : : "memory", "cc"
);
}
This ensures that it can drain the DPC queue as an uninterrupted procedure; as with interrupts disabled, it cannot be interrupted by a clock interrupt and rescheduled. This prevents the scenario of 2 schedulers running at the same time for instance if a thread yielded with Sleep() it ends up calling KeRaiseIrqlToSynchLevel which is analogous to disabling interrupts. This will prevent a timer interrupt interrupting it and scheduling another thread switch over the top of the currently executing thread switch procedure -- it ensures that scheduling is atomic.
It checks to see if there are DPCs on the normal queue of the current core or whether there is a timer expiry or deferred ready threads and then calls KiRetireDpcList which basically contains a while queue depth != 0 loop which first checks to see if it is a timer expiry request (which I won't go into now), if not, acquires the DPC queue spinlock, takes a DPC off the queue and parses the members into arguments (interrupts still disabled), decreases queue depth, drops spinlock, enables interrupts and calls the DeferredRoutine. When the DeferredRoutine returns, it disables interrupts again and if there are more in the queue it reacquires the spinlock (spinlock and interrupts disabled ensure that the DPC removal from the queue is atomic so that another interrupt and hence DPC queue drain does not work on the same DPC — it will be already removed from the queue). Since the DPC queue spinlock is not implemented yet on ReactOS we can postulate what might happen on windows: if it fails to acquire the spinlock then given that it's a spinlock and that we are still at DISPATCH_LEVEL and interrupts are disabled, it would spin until the thread on the other core calls KeReleaseSpinLockFromDpcLevel(&DpcData->DpcLock); which is not that much holdup as each thread has the spinlock for about 100 uops I'd say, so we can afford to have interrupts disabled at DISPATCH_LEVEL.
Note that the drain procedure only ever drains the queue of the current core. When the DPC queue is empty, it reenables interrupts and checks to see if there are any deferred ready threads and makes them all ready. It then returns down the callchain to KiInterruptTemplate and then the ISR officially ends.
So, as an overview, in KeInsertQueuedpc, if the DPC to queue is to another core and it is of high priority or the queue depth exceeds the maximum defined in the PRCB then it sets the DpcRequested flag in the PRCB of the core and sends an IPI to the core which most likely runs KiDispatchInterrupt in some way (the ISR could just be the IRQL lower procedure that indeed calls KiDispatchinterrupt) which will drain the DPC queue on that core; the actual wrapper that calls KiDispatchinterrupt may or may not disable the DpcRequested flag in the PRCB like HalpLowerIrql does but I don't know, it may indeed be HalpLowerIrql as I suggested. After KeInsertQueuedpc, when it lowers the IRQL, nothing happens because the DpcRequested flag is in the other core and not the current core. If the DPC to queue is targeted at the current core then if it is of high or medium priority or the queue depth has exceeded the maximum queue depth and the DPC rate is less than the minimum rate defined in the PRCB then it sets the DpcRequested flag in the PRCB and requests a self IPI which will call the same generic wrapper which is used by the scheduler as well so probably something like HalpLowerIrql. After KeInsertQueuedpc it lowers the IRQL with HalpLowerIrql and sees DpcRequested so drains the queue of the current core before lowering IRQL.
Do you see the problem with this though? WRK shows a 'software' interrupt being requested (whose ISR probably calls KiDispatchInterrupt as it is a multi-purpose function and there is only one function that is ever used:
KiRequestSoftwareInterrupt(DISPATCH_LEVEL) in all scenarios) but then ReactOS shows KiDispatchInterrupt being called when the IRQL drops as well. You'd expect that when KiInterruptDispatch drops the ISR spinlock, the function to do so would just check for deferred ready threads or timer expiry request and then just drop the IRQL because the software interrupt to drain the queue will happen as soon as the LAPIC TPR is programmed but ReactOS actually checks for items on the queue (using the flag on the PRCB) and initiates the queue draining in the procedure to lower the IRQL. There is no WRK source code for the spinlock releasing but let's assume that it just doesn't do what happens on ReactOS and lets the 'software' interrupt handle it -- perhaps it leaves that whole DPC queue check out of its equivalent of HalpLowerIrql. But wait a second, what's the Prcb->DpcInterruptRequested for then if it's not used for initiating the queue draining like on ReactOS? Perhaps it is merely used as a control variable so that it doesn't queue 2 software interrupts. We also note that ReactOS also requests a 'software' interrupt at this stage (to arm's Vectored Interrupt Controller) which is extremely odd. So maybe not then. This blatantly does suggests that it gets called twice. It appears that it drains the queue and then the 'software' interrupt comes in immediately after when the IRQL drops (which most likely also calls KiRetireDpcList at some stage) both on ReactOS and WRK and does the same thing. I wonder what anyone makes of that. I mean why both Self IPI and then drain the queue anyway? One of these actions is redundant.
As for lazy IRQL. I see no evidence of it on the WRK or ReactOS, but where it would be implemented would be KiInterruptDispatch. It would be possible to get the current IRQL using KeGetCurrentIrql and then comparing it to the IRQL of the interrupt object and then programming the TPR to correspond to the current IRQL. It either quiesces the interrupt and queues another for that vector using a self IPI or it would just simply switch trap frames.
Related
What request_irq() does internally?
As I know it "allocate an interrupt line", but > what is happening after request_irq()? > How a particular handler is getting called on receiving interrupt? Can anybody explain it with code snipet?
what is happening after request_irq()? A device driver registers an interrupt handler and enables a given interrupt line for handling by calling request_irq(). the call flow is :- request_irq() -> setup_irq() to register the struct irqaction. setup_irq() -> start_irq_thread() to create a kernel thread to service the interrupt line. The thread’s work is implemented in do_irqd(). Only one thread can be created per interrupt line, and shared interrupts are still handled by a single thread. through request_irq() use ISR(Interrupt handler) is passed to start_irq_thread(). start_irq_thread() create a kernel thread that call your ISR. How a particular handler is getting called on receiving interrupt? when an interrupt occur, PIC controller give interrupt info to cpu. A device sends a PIC chip an interrupt, and the PIC tells the CPU an interrupt occurred (either directly or indirectly). When the CPU acknowledges the "interrupt occurred" signal, the PIC chip sends the interrupt number (between 00h and FFh, or 0 and 255 decimal) to the CPU. this interrupt number is used an index of interrupt vector table. A processor typically maps each interrupt type to a corresponding pointer in low memory. The collection of pointers for all the interrupt types is an interrupt vector. Each pointer in the vector points to the ISR for the corresponding interrupt type (IRQ line)." An interrupt vector is only ONE memory address of one interrupt handler. An interrupt vector table is a group of several memory addresses." for further reading http://wiki.osdev.org/Interrupts
Intel 8259 PIC - Acknowledge interrupt
Assume we have a system with CPU which is fully compatible with Intel 8259 Programmable Interrupt Controller. So, this CPU use vectored interrupts, of course. When one of eight interrupts occurs, PIC just asserts INTR wire that is connected to the CPU. Now PIC waits for CPU until INTA will be asserted. When so, PIC selects interrupt with the highest priority (depends on pin number), and then send its interrupt vector to data bus. I omitted some timing, but it doesn't matter for now, I think. Here are questions: How whole device, that causes interrupt, knows that his interrupt request was accepted and it can pull off interrupt request? I read about 8259, but I didn't find it. Is acknowledge device, whose interrupt was accepted, performed in ISR? Sorry for my English.
The best reference is the original intel doc and is available here: https://pdos.csail.mit.edu/6.828/2012/readings/hardware/8259A.pdf It has full details of these modes, how the device operates, and how to program the device. Caveat: I'm a bit rusty as I haven't programmed the 8259 in many years, but I'll take a shot at explaining things, per your request. After an interrupting device, connected to an IRR ["interrupt request register"] pin, has asserted an interrupt request, the 8259 will convey this to the CPU by assserting INTR and then placing the vector on the bus during the three INTA cycles generated by the CPU. After a given device has asserted IRR, the 8259's IS ["in-service"] register is or'ed with a mask of the IRR pin number. The IS is a priority select. While the IS bit is set, other interrupting devices of lower priority [or the original one] will not cause an INTR/INTA cycle to the CPU. The IS bit must be cleared first. These interrupts remain "pending". The IS can be cleared by an EOI (end-of-interrupt) operation. There are multiple EOI modes that can be programmed. The EOI can be generated by the 8259 in AEOI mode. In other modes, the EOI is generated manually by the ISR by sending a command to the 8259. The EOI action is all about allowing other devices to cause interrupts while the ISR is processing the current one. The EOI does not clear the interrupting device. Clearing the interrupting device must be done by the ISR using whatever device specific register the device has for that purpose. Usually, this a "pending interrupt" register [can be 1 bit wide]. Most H/W uses two interrupt related registers and the other one is an "interrupt enable" register. With level triggered interrupts, if the ISR does not clear the device, when the ISR does issue the EOI command to the 8259, the 8259 will [try to] reinterrupt the CPU using the vector for the same device for the same condition. The CPU will probably be reinterrupted as soon as it issues an sti or iret instruction. Thus, an ISR routine must take care to process things in proper sequence. Consider an example. We have a video controller that has four sources for interrupts: HSTART -- start of horizontal line HEND -- end of horizontal line [start of horizontal blanking interval] VSTART -- start of new video field/frame VEND -- end of video field/frame [start of vertical blanking interval] The controller presents these as a bit mask in its own special interrupt source register, which we'll call vidintr_pend. We'll call the interrupt enable register vidintr_enable. The video controller will use only one 8259 IRR pin. It is the responsibility of the CPU's video ISR to interrogate the vidpend register and decide what to do. The video controller will assert its IRR pin as long as vidpend is non-zero. Since we're level triggered, the CPU may be re-interrupted. Here is a sample ISR routine to go with this: // video_init -- initialize controller void video_init(void) { write_port(...); write_port(...); write_port(...); ... // we only care about the vertical interrupts, not the horizontal ones write_port(vidintr_enable,VSTART | VEND); } // video_stop -- stop controller void video_stop(void) { // stop all interrupt sources write_port(vidintr_enable,0); write_port(...); write_port(...); write_port(...); ... } // vidisr_process -- process video interrupts void vidisr_process(void) { u32 pendmsk; // NOTE: we loop because controller may assert a new, different interrupt // while we're processing a given one -- we don't want to exit if we _know_ // we'll be [almost] immediately re-entered while (1) { pendmsk = port_read(vidintr_pend); if (pendmsk == 0) break; // the normal way to clear on most H/W is a writeback // writing a 1 to a given bit clears the interrupt source // writing a 0 does nothing // NOTE: with this method, we can _never_ have a race condition where // we lose an interrupt port_write(vidintr_pend,pendmsk); if (pendmsk & HSTART) ... if (pendmsk & HEND) ... if (pendmsk & VSTART) ... if (pendmsk & VEND) ... } } // vidisr_simple -- simple video ISR routine void vidisr_simple(void) { // NOTE: interrupt state has been pre-saved for us ... // process our interrupt sources vidisr_process(); // allow other devices to cause interrupts port_write(8259,SEND_NON_SPECIFIC_EOI) // return from interrupt by popping interrupt state iret(); } // vidisr_nested -- video ISR routine that allows nested interrupts void vidisr_nested(void) { // NOTE: interrupt state has been pre-saved for us ... // allow other devices to cause interrupts port_write(8259,SEND_NON_SPECIFIC_EOI) // allow us to receive them sti(); // process our interrupt sources // this can be interrupted by another source or another device vidisr_process(); // return from interrupt by popping interrupt state iret(); } UPDATE: Your followup questions: Why do you use interrupt disable on video controller register instead of mask 8259's interrupt enable bit? When you execute vidisr_nested(void) function, it will enable nesting the same interrupt. Is it true? And is that what you want? To answer (1), we should do both but not necessarily in the same place. They seem similar, but work in slightly different ways. We change the video controller registers in the video controller driver [as it's the only place that "understands" the video controller's registers]. The video controller actually asserts the 8259's IRR pin from: IRR = ((vidintr_enable & vidintr_pend) != 0). If we never set vidintr_enable (i.e. it's all zeroes), then we can operate the device in a "polled" [non-interrupt] mode. The 8259 interrupt enable register works similarly, but it masks against which IRRs [asserted or not] may interrupt the CPU. The device vidintr_enable controls whether it will assert IRR or not. In the example video driver, the init routine enables the vertical interrupts, but not the horizontal. Only the vertical interrupts will generate a call to the ISR, but the ISR can/will also process the horizontal ones [as polled bits]. Changing the 8259 interrupt enable mask should be done in a place that understands the interrupt topology of the entire system. This is usually done by the containing OS. That's because the OS knows about the other devices and can make the best choice. Herein, "containing OS" could be a full OS like Linux [of which I'm most familiar]. Or, it could just be an R/T executive [or boot rom--I've written a few] that has some common device handling framework with "helper" functions for the device drivers. For example, although it's usual that all devices get their own IRR pin. But, it is possible, with level triggering, for two different devices to share an IRR. (e.g.) IRR[0] = devA_IRROUT | devB_IRROUT. Either through an OR gate [or wired OR(?)]. It's also possible that the device is attached to a "nested" or "cascaded" interrupt controller. IIRC [consult document], it is possible to have a "master" 8259 and [up to] 8 "slave" 8259s. Each slave 8259 connects to an IRR pin of the master. Then, connect devices to the slave IRR pins. For a fully loaded system, you can have 256 interrupting devices. And, the master can have slave 8259s on some IRR pins and real devices on others [a "hybrid" topology]. Usually, only the OS knows enough to deal with this. In a real system, a device driver probably wouldn't touch the 8259 at all. The non-specific EOI would probably have been sent to the 8259 before entering the device's ISR. And, the OS would handle the full "save state" and "restore state" and the driver just handles device specific actions. Also, under an OS, the OS will call the "init" and "stop" routines. The general OS routines for this will handle the 8259 and call the device specific ones. For example, under Linux [or almost any other OS or R/T executive], the interrupt sequence goes something like this: - CPU hardware actions [atomic]: - push %esp and flags register [has CPU interrupt enable flag] to stack - clear CPU interrupt enable flag (e.g. implied cli) - jump within interrupt vector table - OS general ISR (preset within IVT): - push all remaining registers to stack - send non-specific EOI to 8259(s) - call device-specific ISR (NOTE: CPU interrupt flag still clear) - pop regs - iret To answer (2), yes, you are correct. It would probably interrupt immediately, and might nest (infinitely :-). The simple ISR version is more efficient and preferable if the actions taken in the ISR are short, quick, and simple (e.g. just output to a few data ports). If the required actions take a relatively long time (e.g. do intensive calculations, or write to a large number of ports or memory locations), the nested version is preferred to prevent other devices from having entry to their ISRs delayed excessively. However, some time critical devices [like a video controller] need to use the simple model, preventing interruption by other devices, to guaranteed that they can complete in a finite, deterministic time. For example, the video ISR handling of VEND might program the device for the next/upcoming field/frame and must complete this within the vertical blanking interval. They, have to do this, even if it means "excessive" delay of other ISRs. Note that the ISR was "racing" to complete before the end of the blanking interval. Not the best design. I've had to program such a controller/device. For rev 2, we changed the design so the device registers were double-buffered. That meant that we could set up the registers for frame 1 anytime during the [much longer] frame 0 display period. At VSTART for frame 1, the video hardware would instantly clock-in/save the double-buffered values, and the CPU could then setup for frame 2 anytime during the display of frame 1. And so on ... With the modified design, the video driver removed the device setup from the ISR entirely. It was now handled from OS task level In the driver example, I've adjusted the sequencing a bit to prevent infinite stacking, and added some additional information based upon my question (1) answer. That is, it shows [crudely] what to do with or without an OS. // video controller driver // // for illustration purposes, STANDALONE means a very simple software system // // if it's _not_ defined, we assume the ISR is called from an OS general ISR // that handles 8259 interactions // // if it's _defined_, we're showing [crudely] what needs to be done // // NOTE: although this is largely C code, it's also pseudo-code in places // video_init -- initialize controller void video_init(void) { write_port(...); write_port(...); write_port(...); ... #ifdef STANDALONE write_port(8259_interrupt_enable |= VIDEO_IRR_PIN); #endif // we only care about the vertical interrupts, not the horizontal ones write_port(vidintr_enable,VSTART | VEND); } // video_stop -- stop controller void video_stop(void) { // stop all interrupt sources write_port(vidintr_enable,0); #ifdef STANDALONE write_port(8259_interrupt_enable &= ~VIDEO_IRR_PIN); #endif write_port(...); write_port(...); write_port(...); ... } // vidisr_pendmsk -- get video controller pending mask (and clear it) u32 vidisr_pendmsk(void) { u32 pendmsk; pendmsk = port_read(vidintr_pend); // the normal way to clear on most H/W is a writeback // writing a 1 to a given bit clears the interrupt source // writing a 0 does nothing // NOTE: with this method, we can _never_ have a race condition where // we lose an interrupt port_write(vidintr_pend,pendmsk); return pendmsk; } // vidisr_process -- process video interrupts void vidisr_process(u32 pendmsk) { // NOTE: we loop because controller may assert a new, different interrupt // while we're processing a given one -- we don't want to exit if we _know_ // we'll be [almost] immediately re-entered while (1) { if (pendmsk == 0) break; if (pendmsk & HSTART) ... if (pendmsk & HEND) ... if (pendmsk & VSTART) ... if (pendmsk & VEND) ... pendmsk = port_read(vidintr_pend); } } // vidisr_simple -- simple video ISR routine void vidisr_simple(void) { u32 pendmsk; // NOTE: interrupt state has been pre-saved for us ... pendmsk = vidisr_pendmsk(); // process our interrupt sources vidisr_process(pendmsk); // allow other devices to cause interrupts #ifdef STANDALONE port_write(8259,SEND_NON_SPECIFIC_EOI) #endif // return from interrupt by popping interrupt state #ifdef STANDALONE pop_regs(); iret(); #endif } // vidisr_nested -- video ISR routine that allows nested interrupts void vidisr_nested(void) { u32 pendmsk; // NOTE: interrupt state has been pre-saved for us ... // get device pending mask -- do this _before_ [optional] EOI and the sti // to prevent immediate stacked interrupts pendmsk = vidisr_pendmsk(); // allow other devices to cause interrupts #ifdef STANDALONE port_write(8259,SEND_NON_SPECIFIC_EOI) #endif // allow us to receive them // NOTE: with or without OS, we can't stack until _after_ this sti(); // process our interrupt sources // this can be interrupted by another source or another device vidisr_process(pendmsk); // return from interrupt by popping interrupt state #ifdef STANDALONE pop_regs(); iret(); #endif } BTW, I'm the author of the linux irqtune program I wrote it back in the mid 90's. It's of lesser use now, and probably doesn't work on modern systems, but the FAQ I wrote has a great deal of information about interrupt device priorities. The program itself did a simple 8259 manipulation. An online copy is available here: http://archive.debian.org/debian/dists/Debian-1.1/main/disks-i386/SpecialKernels/irqtune/README.html There's probably source code somewhere in this archive. That's the version 0.2 doc. I haven't found an online copy of version 0.6 which has better explanation, so I've put up a text version here: http://pastebin.com/Ut6nCgL6 Side note: The "where to get" information in the FAQ [and email address] are no longer valid. And, I didn't understand the full impact of "spam" until I posted the FAQ and starting getting [tons of] it ;-) And, irqtune even drew Linus' ire. Not because it didn't work but because it did: https://lkml.org/lkml/1996/8/23/19 IMO, if he had read the FAQ, he would have understood why [as what irqtune did is standard stuff to R/T guys]. UPDATE #2 Your new questions: I think that you are missing a destination address in write_port(8259_interrupt_enable &= ~VIDEO_IRR_PIN). Isn't it so? IRR register is read-only or r/w? If the second case, what is the purpose of writing into it? Interrupt vectors are stored as logical addresses or physical address? To answer question (3): No, not really [even if it seemed so]. The code snippet was "pseudo code" [not pure C code], as I mentioned in a code comment at the top, so technically speaking, I'm covered. However, to make it more clear, here is what the [closer to] real C code would look like: // the system must know _which_ IRR H/W pin the video controller is connected to // so we _hardwire_ it here #define VIDEO_IRR_PIN_NUMBER 3 // just an example #define VIDEO_IMR_MASK (1 << VIDEO_IRR_PIN_NUMBER) // video_enable -- enable/disable video controller in 8259 void video_enable(int enable) { u32 val; // NOTE: we're reading/writing the _enable_ register, not the IRR [which // software can _not_ modify or read] val = read_port(8259_interrupt_enable); if (enable) val |= VIDEO_IMR_MASK; else val &= ~VIDEO_IMR_MASK; write_port(8259_interrupt_enable,val); } Now, in video_init, replace the code inside STANDALONE with video_enable(1), and, in video_stop with video_enable(0) As to question (4): We weren't really writing to the IRR, even though the symbol had _IRR_ in it. As mentioned in the code comments above, we were writing to the 8259 interrupt enable register which is really the "interrupt mask register" or IMR in the documentation. The IMR can be read from and written to by using OCW1 (see doc). There is no way for software to access the IRR at all. (i.e.) There is no port in the 8259 to read or write the IRR value. The IRR is completely internal to the 8259. There is a one-to-one correspondence between IRR pin numbers [0-7] and IMR bit numbers (e.g. to enable for IRR(0), set IMR bit 0), but the software has to know which bit to set. Because the video controller is physically connected to a given IRR pin, it is always the same for a given PC board. The software [on older non-PnP systems] can't probe for this. Even on newer systems, the 8259 knows nothing of PnP, so it's still hardwired. The video controller driver programmer must just "know" what IRR pin is being used [by consulting the "spec sheet" or controller "architecture reference manual"]. To answer question (5): First consider what the 8259 does. When the 8259 is intialized, the ICW2 ("initialization command word 2") gets set by the OS driver. This defines a portion of interrupt vector number the 8259 will present during the INTR/INTA cycle. In ICW2, the most significant 5 bits are marked T7-T3. When an interrupt occurs, these bits are combined with the IRR pin number of the interrupting device [which is 3 bits wide] to form an 8 bit interrupt vector number: T7,T6,T5,T4,T3|I2,I1,I0 For example, if we put 0xD0 into ICW2, with our video controller using IRR pin 3, we'd have 1,1,0,1,0|0,1,1 or 0xD3 as the interrupt vector number that the 8259 will send to the CPU. This is just a vector number [0x00-0xFF] as the 8259 knows nothing of memory addresses. It is the CPU that takes this vector number and, using the CPU's "interrupt vector table" [IVT], uses the vector number as an index into the IVT to properly vector the interrupt to an ISR routine. On 80386 and later architectures, the IVT is actually called an IDT ("interrupt descriptor table"). For details, see the "System Programming Guide", chapter 6: http://download.intel.com/design/processor/manuals/253668.pdf As, to whether the resulting ISR address from the IVT/IDT is physical or logical depends on the processor mode (e.g. real mode, protected mode, protected with virtual addressing enabled). In a sense, all such addresses are always logical. And, all logical addresses undergo a translation to physical on each CPU instruction. Whether the translation is one-to-one [MMU not enabled or page tables have one-to-one mapping] is a question for "How has the OS set things up?"
Strictly speaking, there is no such thing as "acknowledge an interrupt to device". The thing that an ISR should do, is to handle the interrupt condition. For example, if the UART requested an interrupt because it has an incoming data, then you should read that incoming data. After that read operation, UART no longer has the incoming data, so naturally it stops asserting the IRQ line. Alternatively, if your program no longer needs to read the data and wants to stop the communication, it would just mask the receiver interrupt via the UART registers, and, once again, UART will stop asserting the IRQ line. If the device just wanted to signal you some state change, then you should read the new state, and the device will know that you have an up-to-date state and will release an IRQ line. So, in short: there is usually no any device-specific acknowledge procedure. All you need to do is to service an interrupt condition, after which, that condition will disappear, voiding the interrupt request.
Nested Interrupt Handling in ARM
Below is the flow mentioned in the Cortex A Prog Guide, I have a few questions on the text. A reentrant interrupt handler must therefore take the following steps after an IRQ exception is raised and control is transferred to the interrupt handler in the way previously described. • The interrupt handler saves the context of the interrupted program (that is, it pushes onto the alternative kernel mode stack any registers which will be corrupted by the handler, including the return address and SPSR_IRQ). Q> What is the alternative kernel mode stack here ? • It determines which interrupt source needs to be processed and clears the source in the external hardware (preventing it from immediately triggering another interrupt). • The interrupt handler changes the processor to the other kernel mode, leaving the CPSR I bit set (interrupts are still disabled). Q> From IRQ to SVC mode with CPSR.I =1 . Right ? • The interrupt handler saves the exception return address on the stack (a stack for the new mode, located in kernel memory) and re-enables interrupts. Q> Are there 2 stacks here ? • It calls the appropriate C handler for the original interrupt (interrupts are still disabled). • Upon completion, the interrupt handler disables IRQ and pops the exception return address from the stack. • It restores the context of the interrupted program directly from the alternative kernel mode stack. This includes restoring the PC, and the CPSR which switches back to the previous execution mode. Q> How is the nesting done here ? I am bit confused here...
1) Up to you, really. The requirement is that it is one that cannot be asynchronously invoked. So you can use System mode stack, which is shared with User mode - with some interesting implications. Or you can use the Supervisor mode stack, as long as you always properly store all context before executing an SVC instruction. 2) Yes. 3) Yes, you store the context on a stack for whichever mode picked in (1). 4) While executing in the alternative mode, you re-enable interrupts (as your text states). At this point, the processor will now react to new interrupts signalled to the core - generally ones of a higher priority as configured in your interrupt controller.
Can an interrupt handler be preempted by the same interrupt handler?
Does the CPU disable all interrupts on local CPU before calling the interrupt handler? Or does it only disable that particular interrupt line, which is being served?
x86 disables all local interrupts (except NMI of course) before jumping to the interrupt vector. Linux normally masks the specific interrupt and re-enables the rest of the interrupts (which aren't masked), unless a specific flags is passed to the interrupt handler registration. Note that while this means your interrupt handler will not race with itself on the same CPU, it can and will race with itself running on other CPUs in an SMP / SMT system.
Normally (at least in x86), an interrupt disables interrupts. When an interrupt is received, the hardware does these things: 1. Save all registers in a predetermined place. 2. Set the instruction pointer (AKA program counter) to the interrupt handler's address. 3. Set the register that controls interrupts to a value that disables all (or most) interrupts. This prevents another interrupt from interrupting this one. An exception is NMI (non maskable interrupt) which can't be disabled.
Yes, that's fine. I'd like to also add what I think might be relevant. In many real-world drivers/kernel code, "bottom-half" (bh) handlers are used pretty often- tasklets, softirqs. These bh's run in interrupt context and can run in parallel with their top-half (th) handlers on SMP (esp softirq's). Of course, recently there's a move (mainly code migrated from the PREEMPT_RT project) towards mainline, that essentially gets rid of the 'bh' mechanism- all interrupt handlers will run with all interrupts disabled. Not only that, handlers are (can be) converted to kernel threads- these are the so-called "threaded" interrupt handlers. As of today, the choice is still left to the developer- you can use the 'traditional' th/bh style or the threaded style. Ref and Details: http://lwn.net/Articles/380931/ http://lwn.net/Articles/302043/
Quoting Intels own, surprisingly well-written "Intel® 64 and IA-32 Architectures Software Developer’s Manual", Volume 1, pages 6-10: If an interrupt or exception handler is called through an interrupt gate, the processor clears the interrupt enable (IF) flag in the EFLAGS register to prevent subsequent interrupts from interfering with the execution of the handler. When a handler is called through a trap gate, the state of the IF flag is not changed. So just to be clear - yes, effectively the CPU "disables" all interrupts before calling the interrupt handler. Properly described, the processor simply triggers a flag which makes it ignore all interrupt requests. Except probably non-maskable interrupts and/or its own software exceptions (please someone correct me on this, not verified).
We want ISR to be atomic and no one should be able to preempt the ISR. Therefore, An ISR disables the local interrupts ( i.e. the interrupt on the current processor) and once the ISR calls ret_from_intr() function ( i.e. we have finished the ISR) , interrupts are again enabled on the current processor. If an interrupt occurs, it will now be served by the other processor ( in SMP system) and ISR related to that interrupt will start running. In SMP system , We also need to include the proper synchronization mechanism ( spin lock) in an ISR.
spin_lock_irqsave vs spin_lock_irq
On an SMP machine we must use spin_lock_irqsave and not spin_lock_irq from interrupt context. Why would we want to save the flags (which contain the IF)? Is there another interrupt routine that could interrupt us?
spin_lock_irqsave is basically used to save the interrupt state before taking the spin lock, this is because spin lock disables the interrupt, when the lock is taken in interrupt context, and re-enables it when while unlocking. The interrupt state is saved so that it should reinstate the interrupts again. Example: Lets say interrupt x was disabled before spin lock was acquired spin_lock_irq will disable the interrupt x and take the the lock spin_unlock_irq will enable the interrupt x. So in the 3rd step above after releasing the lock we will have interrupt x enabled which was earlier disabled before the lock was acquired. So only when you are sure that interrupts are not disabled only then you should spin_lock_irq otherwise you should always use spin_lock_irqsave.
If interrupts are already disabled before your code starts locking, when you call spin_unlock_irq you will forcibly re-enable interrupts in a potentially unwanted manner. If instead you also save the current interrupt enable state in flags through spin_lock_irqsave, attempting to re-enable interrupts with the same flags after releasing the lock, the function will just restore the previous state (thus not necessarily enabling interrupts). Example with spin_lock_irqsave: spinlock_t mLock = SPIN_LOCK_UNLOCK; unsigned long flags; spin_lock_irqsave(&mLock, flags); // Save the state of interrupt enable in flags and then disable interrupts // Critical section spin_unlock_irqrestore(&mLock, flags); // Return to the previous state saved in flags Example with spin_lock_irq( without irqsave ): spinlock_t mLock = SPIN_LOCK_UNLOCK; unsigned long flags; spin_lock_irq(&mLock); // Does not know if interrupts are already disabled // Critical section spin_unlock_irq(&mLock); // Could result in an unwanted interrupt re-enable...
The need for spin_lock_irqsave besides spin_lock_irq is quite similar to the reason local_irq_save(flags) is needed besides local_irq_disable. Here is a good explanation of this requirement taken from Linux Kernel Development Second Edition by Robert Love. The local_irq_disable() routine is dangerous if interrupts were already disabled prior to its invocation. The corresponding call to local_irq_enable() unconditionally enables interrupts, despite the fact that they were off to begin with. Instead, a mechanism is needed to restore interrupts to a previous state. This is a common concern because a given code path in the kernel can be reached both with and without interrupts enabled, depending on the call chain. For example, imagine the previous code snippet is part of a larger function. Imagine that this function is called by two other functions, one which disables interrupts and one which does not. Because it is becoming harder as the kernel grows in size and complexity to know all the code paths leading up to a function, it is much safer to save the state of the interrupt system before disabling it. Then, when you are ready to reenable interrupts, you simply restore them to their original state: unsigned long flags; local_irq_save(flags); /* interrupts are now disabled */ /* ... */ local_irq_restore(flags); /* interrupts are restored to their previous state */ Note that these methods are implemented at least in part as macros, so the flags parameter (which must be defined as an unsigned long) is seemingly passed by value. This parameter contains architecture-specific data containing the state of the interrupt systems. Because at least one supported architecture incorporates stack information into the value (ahem, SPARC), flags cannot be passed to another function (specifically, it must remain on the same stack frame). For this reason, the call to save and the call to restore interrupts must occur in the same function. All the previous functions can be called from both interrupt and process context.
Reading Why kernel code/thread executing in interrupt context cannot sleep? which links to Robert Loves article, I read this : some interrupt handlers (known in Linux as fast interrupt handlers) run with all interrupts on the local processor disabled. This is done to ensure that the interrupt handler runs without interruption, as quickly as possible. More so, all interrupt handlers run with their current interrupt line disabled on all processors. This ensures that two interrupt handlers for the same interrupt line do not run concurrently. It also prevents device driver writers from having to handle recursive interrupts, which complicate programming.
Below is part of code in linux kernel 4.15.18, which shows that spiin_lock_irq() will call __raw_spin_lock_irq(). However, it will not save any flags as you can see below part of the code but disable the interrupt. static inline void __raw_spin_lock_irq(raw_spinlock_t *lock) { local_irq_disable(); preempt_disable(); spin_acquire(&lock->dep_map, 0, 0, _RET_IP_); LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock); } Below code shows spin_lock_irqsave() which saves the current stage of flag and then preempt disable. static inline unsigned long __raw_spin_lock_irqsave(raw_spinlock_t *lock) { unsigned long flags; local_irq_save(flags); preempt_disable(); spin_acquire(&lock->dep_map, 0, 0, _RET_IP_); /* * On lockdep we dont want the hand-coded irq-enable of * do_raw_spin_lock_flags() code, because lockdep assumes * that interrupts are not re-enabled during lock-acquire: */ #ifdef CONFIG_LOCKDEP LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock); #else do_raw_spin_lock_flags(lock, &flags); #endif return flags; }
This question starts from the false assertion: On an SMP machine we must use spin_lock_irqsave and not spin_lock_irq from interrupt context. Neither of these should be used from interrupt context, on SMP or on UP. That said, spin_lock_irqsave() may be used from interrupt context, as being more universal (it can be used in both interrupt and normal contexts), but you are supposed to use spin_lock() from interrupt context, and spin_lock_irq() or spin_lock_irqsave() from normal context. The use of spin_lock_irq() is almost always the wrong thing to do in interrupt context, being this SMP or UP. It may work because most interrupt handlers run with IRQs locally enabled, but you shouldn't try that. UPDATE: as some people misread this answer, let me clarify that it only explains what is for and what is not for an interrupt context locking. There is no claim here that spin_lock() should only be used in interrupt context. It can be used in a process context too, for example if there is no need to lock in interrupt context.