Related
I know that for hardware interrupts, when KeLowerIrql is called by KeAcquireInterruptSpinLock, the HAL adjusts the interrupt mask in the LAPIC, which will allow queued interrupts (in the IRR probably) to be serviced automatically. But with software interrupts, for instance, ntdll.dll sysenter calls to the SSDT NtXxx system services, how are they 'postponed' and triggered when the IRQL goes to passive level Same goes for the DPC dispatcher software interrupt (if the DPC is for the current CPU and of high priority), how is that triggered when IRQL < Dispatch IRQL? Do the software interrupt called functions (NtXxx) in the SSDT all loop on a condition i.e.
while (irql != passive)
Exactly the same question for lazy IRQL:
Because accessing a PIC is a relatively slow operation, HALs that require accessing the I/O bus to change IRQLs, such as for PIC and 32-bit Advanced Configuration and Power Interface (ACPI) systems, implement a performance optimization, called lazy IRQL, that avoids PIC accesses. When the IRQL is raised, the HAL notes the new IRQL internally instead of changing the interrupt mask. If a lower-priority interrupt subsequently occurs, the HAL sets the interrupt mask to the settings appropriate for the first interrupt and does not quiesce the lower-priority interrupt until the IRQL is lowered (thus keeping the interrupt pending). Thus, if no lower-priority interrupts occur while the IRQL is raised, the HAL doesn’t need to modify the PIC.
How does it keep this interrupt pending? Does it just loop on a condition until the higher priority ISR lowers the IRQL and when the thread is scheduled in, the condition will eventually be met? Is it just that simple?
Edit: I must be missing out on something here because let's say an ISR at Device IRQL requests a DPC using IoRequestDpc, if it is a high priority DPC and the target is the current processor then it schedules an interrupt of DPC/Dispatch level to drain the processor's DPC queue. This is all happening in the ISR which is at Device IRQL (DIRQL) which means that the software interrupt with Dispatch/DPC IRQL level will spin at KeAcquireInterruptSpinLock I think because the current IRQL is too high, but wouldn't it be spinning there forever because the actual routine to lower the IRQL is called after the ISR returns meaning that it's going to stay stuck in the ISR at Device IRQL waiting on that software interrupt which requires IRQL < Dispatch/DPC IRQL (2), not only that, the dispatcher will not be able to dispatch the next thread because the dispatch DPC runs at Dispatch/DPC IRQL level which is far lower. There 1 solution I can think of.
1) The ISR returns the KDPC object to the KiInterruptDispatch so that it knows what priority the DPC is and then schedules it itself after it has lowered the IRQL using KeReleaseInterruptSpinLock but KSERVICE_ROUTINE only returns an unrelated boolean value so this is ruled out.
Does anyone know how this situation is avoided?
Edit 2: Perhaps it spawns a new thread that blocks waiting for IRQL < Dispatch IRQL and then returns from the ISR and drops the IRQL.
This is something that isn't really explained explicitly on any source and interestingly enough the second comment also asks the same question.
Firstly, DPC software interrupts aren't like regular SSDT software interrupts, which are not postponed and run at passive IRQL and can be interrupted at any time. DPC software interrupts do not use int or syscall or anything like that, are postponed and run at dispatch level.
After studying the ReactOS kernel and WRK, I now know exactly what happens
A driver, when it receives IRP_MN_START_DEVICE from the PnP manager, initialises an interrupt object using IoConnectInterrupt using the data in the CM_RESOURCE_LIST it receives in the IRP. Of particular interest is the vector and affinity that was assigned by the PnP manager to the device (which is simple to do if the device exposes an MSI capability in its PCIe configuration space as it doesn't have to worry about underlying IRQ routing). It passes the vector, a pointer to an ISR, context for the ISR, IRQL to IoConnectInterrupt which calls KeInitializeInterrupt to initialise the interrupt object using the parameters and then it calls KeConnectInterrupt which switches the affinity of the current thread to the target processor, locks the dispatcher database and checks that that IDT entry points to a BugCheck wrapper KxUnexpectedInterrupt0[IdtIndex]. If it is then it raises IRQL to 31 so the following is an atomic operation and uses the HAL API to enable the vector that was mapped by the PnP manager on the LAPIC and assign it a TPR priority level corresponding to the IRQL. It then maps the vector to the handler address in the IDT entry for ther vector. To do this it passes the address &Interrupt->DispatchCode[0] into the IDT mapping routine KeSetIdtHandlerAddress. It appears this is a template which is the same for all interrupt objects which according to the WRK is KiInterruptTemplate. Sure enough, checking ReactOS kernel, we see in KeInitializeInterrupt -- which is called by IoConnectInterrupt -- the code:
RtlCopyMemory(Interrupt->DispatchCode,
KiInterruptDispatchTemplate,
sizeof(Interrupt->DispatchCode));
KiInterruptDispatchTemplate appears to be blank for now, because ReactOS's amd64 port is in early development. On windows it will be implemented however and as the KiInterruptTemplate.
It then lowers the IRQL back to the old IRQL. If the IDT entry did not point to a BugCheck ISR then it initialises a chained interrupt -- because there was already an address at the IDT entry. It uses CONTAINING_RECORD to acquire the interrupt object by its member, the address of the handler (DispatchCode[0]) and connects the new interrupt object to the one already present, initialising the interrupt object already referenced's LIST_ENTRY as the head of the list and marking it as a chained interrupt by setting the DispatchAddress member to the address of KiChainedDispatch. It then drops the dispatcher database spinlock and switches the affinity back and returns the interrupt object.
The driver then sets up a DPC -- with the DeferredRoutine as a member -- for the Device Object using IoInitializeDpcRequest.
FORCEINLINE VOID IoInitializeDpcRequest ( _In_ PDEVICE_OBJECT DeviceObject, _In_ PIO_DPC_ROUTINE DpcRoutine )
KeInitializeDpc(&DeviceObject->Dpc,
(PKDEFERRED_ROUTINE) DpcRoutine,
DeviceObject);
KeInitializeDpc calls KiInitializeDpc which is hard-coded to set the priority to medium which means that KeInsertQueueDpc will place it in the middle of the DPC queue. KeSetImportanceDpc and KeSetTargetProcessorDpc can be used after the call to set the returned DPC that was generated's priority and target processor respectively. It copies a DPC object to the member of the device object and if there is already a DPC object then it queues it to the DPC already present.
When the interrupt happens, the KiInterruptTemplate template of the interrupt object is the address in the IDT that gets called which will then call the real interrupt dispatcher which is the DispatchAddress member which will be KiInterruptDispatch for a normal interrupt or KiChainedDispatch for a chained interrupt. It passes the interrupt object to KiInterruptDispatch (it can do this because, as we saw earlier, RtlCopyMemory copied KiInterruptTemplate into the interrupt object, this means that it can use an asm block with a relative RIP to acquire the address of the interrupt object it belongs to (it could also attempt to do something with the CONTAINING_RECORD function) but intsup.asm contains the following code to do it : lea rbp, KiInterruptTemplate - InDispatchCode ; get interrupt object address jmp qword ptr InDispatchAddress[rbp]; finish in common code). KiInterruptDispatch will then acquire the interrupt's spinlock, probably using KeAcquireInterruptSpinLock. The ISR (ServiceContext) calls IoRequestDpc with the device object address that was created for the device and ISR, as a parameter, along with interrupt specific context and an optional IRP (which I'm guessing it gets from the head at DeviceObject->Irp if the routine is meant to handle an IRP). I expected it to be a single line wrapper of KeInsertQueue but passing the Dpc member of the device object instead and that's exactly what it is: KeInsertQueueDpc(&DeviceObject->Dpc, Irp, Context);. Firstly KeInsertQueue raises the IRQL from the device IRQL of the ISR to 31 which prevents all preemption. WRK contains the following on line 263 of dpcobj.c:
#if !defined(NT_UP)
if (Dpc->Number >= MAXIMUM_PROCESSORS) {
Number = Dpc->Number - MAXIMUM_PROCESSORS;
TargetPrcb = KiProcessorBlock[Number];
} else {
Number = CurrentPrcb->Number;
TargetPrcb = CurrentPrcb;
}
Which suggests that the DPC->Number member must be set by KeSetTargetProcessorDpc as target core number + maximum processors. This is bizarre and sure enough I went and looked at ReactOS's KeSetTargetProcessorDpc and it does! KiProcessorBlock appears to be a kernel structure for fast-accessing the KPRCB structures for each of the cores.
It then gets the core's normal DPC queue spinlock using DpcData = KiSelectDpcData(TargetPrcb, Dpc) which returns &Prcb->DpcData[DPC_NORMAL] as the type of the DPC it passed to it is normal, not threaded. It then acquires the spinlock for the queue and this appears to be an empty function body on ReactOS and I think it's because of this:
/* On UP builds, spinlocks don't exist at IRQL >= DISPATCH */
And that makes sense because ReactOS only supports 1 core meaning there is no thread on another core that can access the DPC queue (a core might have a target DPC for this core's queue). There is only one DPC queue. If it were a multicore system it would have to acquire the spinlock so these look to be placeholders for when multicore functionality is implemented. If it failed to acquire the spinlock for the DPC queue then it would either spin-wait at IRQL 31 or drop to the IRQL of the interrupt itself and spinwait, allowing other interrupts to occur to the core but no other threads to run on the core.
Note that windows would use KeAcquireSpinLockAtDpcLevel to acquire this spinlock, ReactOS does not. KeAcquireSpinLockAtDpcLevel does not touch the IRQL. Although, in the WRK it directly uses KiAcquireSpinLock which can be seen on line 275 of dpcobj.c which only acquires the spinlock and does nothing to the IRQL (KiAcquireSpinLock(&DpcData->DpcLock);).
After acquiring the spinlock it firstly ensures that the DPC object isn't already on a queue (DpcData member would be null when it does a cmpxchg to initialise it with the DpcData returned from KiSelectDpcData(TargetPrcb, Dpc)) and if it is it drops the spinlock and returns; otherwise, it then sets the DPC members to point to the interrupt specific context that was passed and then it inserts it into the queue either at the head (InsertHeadList(&DpcData->DpcListHead, &Dpc->DpcListEntry);) or the tail (InsertTailList(&DpcData->DpcListHead, &Dpc->DpcListEntry);)based on its priority (if (Dpc->Importance == HighImportance)). It then makes sure that a DPC isn't executing already if (!(Prcb->DpcRoutineActive) && !(Prcb->DpcInterruptRequested)). It then checks if KiSelectDpcData returned the second KDPC_DATA structure i.e. the DPC was of type threaded (if (DpcData == &TargetPrcb->DpcData[DPC_THREADED])) and if it is and if ((TargetPrcb->DpcThreadActive == FALSE) && (TargetPrcb->DpcThreadRequested == FALSE)) then it does a locked xchg to set TargetPrcb->DpcSetEventRequest to true respectively and then it sets TargetPrcb->DpcThreadRequested and TargetPrcb->QuantumEnd to true and it sets RequestInterrupt to true if the target PRCB is the current PRCB otherwise it only sets it to true if the target core is not idle.
Now comes the crux of the original question. The WRK now contains the following code:
#if !defined(NT_UP)
if (CurrentPrcb != TargetPrcb) {
if (((Dpc->Importance == HighImportance) ||
(DpcData->DpcQueueDepth >= TargetPrcb->MaximumDpcQueueDepth))) {
if (((KiIdleSummary & AFFINITY_MASK(Number)) == 0) ||
(KeIsIdleHaltSet(TargetPrcb, Number) != FALSE)) {
TargetPrcb->DpcInterruptRequested = TRUE;
RequestInterrupt = TRUE;
}
}
} else {
if ((Dpc->Importance != LowImportance) ||
(DpcData->DpcQueueDepth >= TargetPrcb->MaximumDpcQueueDepth) ||
(TargetPrcb->DpcRequestRate < TargetPrcb->MinimumDpcRate)) {
TargetPrcb->DpcInterruptRequested = TRUE;
RequestInterrupt = TRUE;
}
}
#endif
In essence, on a multiprocessor system, if the target core it acquired from the DPC object is not the current core of the thread then: If the DPC is of high importance or it exceeds the maximum queue depth and the logical and of the target affinity and the idle cores is 0 (i.e. the target core is not idle) and (well, KeIsIdleHaltSet appears to to be exactly the same thing (it checks the Sleeping flag in the target PRCB)) then it sets a DpcInterruptRequested flag in the PRCB of the target core. If the target of the DPC is the current core then if the DPC is not low importance (note: this would allow medium!) or if the DPC queue depth exceeds the maximum queue depth and if the request rate of DPCs on the core hasn't exceeded the minimum it sets a flag in the PRCB of the current core to indicate there is a DPC.
It now releases the DPC queue spinlock: KiReleaseSpinLock(&DpcData->DpcLock);(#if !defined(NT_UP) of course) (which doesn't alter the IRQL). It then checks to see if an interrupt was requested by the procedure (if (RequestInterrupt == TRUE)), then if it is a uniprocessor system (#if defined(NT_UP)) it simply calls KiRequestSoftwareInterrupt(DISPATCH_LEVEL); but if it is a multicore system it needs to check the target PRCB to see if it needs to send an IPI.
if (TargetPrcb != CurrentPrcb) {
KiSendSoftwareInterrupt(AFFINITY_MASK(Number), DISPATCH_LEVEL);
} else {
KiRequestSoftwareInterrupt(DISPATCH_LEVEL);
}
And it speaks for itself what that does; if the current PRCB is not the target PRCB of the DPC then it sends an IPI of DISPATCH_LEVEL priority to the processor number using KiSendSoftwareInterrupt; otherwise, it uses KiRequestSoftwareInterrupt. There is no documentation at all but my guess is this is a Self IPI, and it will wrap a HAL function that programs the ICR to send an IPI to itself at dispatch level priority (my reasoning being ReactOS at this stage calls HalRequestSoftwareInterrupt which shows an unimplemented PIC write). So it's not a software interrupt in the INT sense but is actually, put simply, a hardware interrupt. It then lowers the IRQL back from 31 to the previous IRQL (which was the ISR IRQL). It then returns to the ISR and then it will return to KiInterruptDispatch; KiInterruptDispatch will then release the ISR spinlock using KeReleaseInterruptSpinLock which will reduce the IRQL to what it was before the interrupt and it then pop the trap frame but I would have thought it would first pop the trap frame and then program the LAPIC TPR so the register restore process is atomic but I suppose it doesn't really matter.
ReactOS has the following (WRK doesn't have KeReleaseSpinlock or the IRQL lowering procedures documented so this is the best we have):
VOID NTAPI KeReleaseSpinLock ( KIRQL NewIrql )
{
/* Release the lock and lower IRQL back */
KxReleaseSpinLock(SpinLock);
KeLowerIrql(OldIrql);
}
VOID FASTCALL KfReleaseSpinLock ( PKSPIN_LOCK SpinLock, KIRQL OldIrql )
{
/* Simply lower IRQL back */
KeLowerIrql(OldIrql);
}
KeLowerIrql is a wrapper for the HAL function KfLowerIrql, the function contains KfLowerIrql(OldIrql); and that's it.
VOID FASTCALL KfLowerIrql ( KIRQL NewIrql )
{
DPRINT("KfLowerIrql(NewIrql %d)\n", NewIrql);
if (NewIrql > KeGetPcr()->Irql)
{
DbgPrint ("(%s:%d) NewIrql %x CurrentIrql %x\n",
__FILE__, __LINE__, NewIrql, KeGetPcr()->Irql);
KeBugCheck(IRQL_NOT_LESS_OR_EQUAL);
for(;;);
}
HalpLowerIrql(NewIrql);
}
This function basically prevents the new IRQL being higher than the current IRQL which makes sense because the function is supposed to lower the IRQL. If everything is ok, the function calls HalpLowerIrql(NewIrql); This is a skeleton of a multiprocessor AMD64 implementation -- it does not actually implement the APIC register writes (or MSRs for x2APIC), they are empty functions on ReactOS's multiprocessor AMD64 implementation as it is in development; but on windows, they wont be and they'll actually program the LAPIC TPR so that the queued software interrupt can now occur.
HalpLowerIrql(KIRQL NewIrql, BOOLEAN FromHalEndSystemInterrupt)
{
ULONG Flags;
UCHAR DpcRequested;
if (NewIrql >= DISPATCH_LEVEL)
{
KeSetCurrentIrql (NewIrql);
APICWrite(APIC_TPR, IRQL2TPR (NewIrql) & APIC_TPR_PRI);
return;
}
Flags = __readeflags();
if (KeGetCurrentIrql() > APC_LEVEL)
{
KeSetCurrentIrql (DISPATCH_LEVEL);
APICWrite(APIC_TPR, IRQL2TPR (DISPATCH_LEVEL) & APIC_TPR_PRI);
DpcRequested = __readfsbyte(FIELD_OFFSET(KIPCR, HalReserved[HAL_DPC_REQUEST]));
if (FromHalEndSystemInterrupt || DpcRequested)
{
__writefsbyte(FIELD_OFFSET(KIPCR, HalReserved[HAL_DPC_REQUEST]), 0);
_enable();
KiDispatchInterrupt();
if (!(Flags & EFLAGS_INTERRUPT_MASK))
{
_disable();
}
}
KeSetCurrentIrql (APC_LEVEL);
}
if (NewIrql == APC_LEVEL)
{
return;
}
if (KeGetCurrentThread () != NULL &&
KeGetCurrentThread ()->ApcState.KernelApcPending)
{
_enable();
KiDeliverApc(KernelMode, NULL, NULL);
if (!(Flags & EFLAGS_INTERRUPT_MASK))
{
_disable();
}
}
KeSetCurrentIrql (PASSIVE_LEVEL);
}
Firstly, it checks to see if the new IRQL is above dispatch level, if so, it sets it to it just fine and writes to the LAPIC TPR register and returns. If not, it checks to see if the current IRQL is dispatch level (>APC_LEVEL). It means that by definition, the new IRQL is going to be less than dispatch level. We can see that in this event it makes it equal to DISPATCH_LEVEL rather than letting it drop below and writes it to the LAPIC TPR register. It then checks is HalReserved[HAL_DPC_REQUEST] which appears to be what ReactOS uses instead of DpcInterruptRequested which we saw previously, so just substitute it with that. It then sets it to 0 (note the PCR begins at the start of segment descriptor pointed to by the FS segment in kernel mode). It then enables interrupts and calls KiDispatchInterrupt and after that if the eflags register changed the IF flag during KiDispatchInterrupt it disables interrupts. It then also checks to see if a kernel APC is pending (which is beyond the scope of this explanation) before finally setting the IRQL to passive level
VOID NTAPI KiDispatchInterrupt ( VOID )
{
PKIPCR Pcr = (PKIPCR)KeGetPcr();
PKPRCB Prcb = &Pcr->Prcb;
PKTHREAD NewThread, OldThread;
/* Disable interrupts */
_disable();
/* Check for pending timers, pending DPCs, or pending ready threads */
if ((Prcb->DpcData[0].DpcQueueDepth) ||
(Prcb->TimerRequest) ||
(Prcb->DeferredReadyListHead.Next))
{
/* Retire DPCs while under the DPC stack */
//KiRetireDpcListInDpcStack(Prcb, Prcb->DpcStack);
// FIXME!!! //
KiRetireDpcList(Prcb);
}
/* Re-enable interrupts */
_enable();
/* Check for quantum end */
if (Prcb->QuantumEnd)
{
/* Handle quantum end */
Prcb->QuantumEnd = FALSE;
KiQuantumEnd();
}
else if (Prcb->NextThread)
{
/* Capture current thread data */
OldThread = Prcb->CurrentThread;
NewThread = Prcb->NextThread;
/* Set new thread data */
Prcb->NextThread = NULL;
Prcb->CurrentThread = NewThread;
/* The thread is now running */
NewThread->State = Running;
OldThread->WaitReason = WrDispatchInt;
/* Make the old thread ready */
KxQueueReadyThread(OldThread, Prcb);
/* Swap to the new thread */
KiSwapContext(APC_LEVEL, OldThread);
}
}
Firstly, it disables interrupts _disable is just a wrapper of an asm block that clears the IF flag and has memory and cc in the clobber list (to prevent compiler reordering). This looks like arm syntax though.
{
__asm__ __volatile__
(
"cpsid i # __cli" : : : "memory", "cc"
);
}
This ensures that it can drain the DPC queue as an uninterrupted procedure; as with interrupts disabled, it cannot be interrupted by a clock interrupt and rescheduled. This prevents the scenario of 2 schedulers running at the same time for instance if a thread yielded with Sleep() it ends up calling KeRaiseIrqlToSynchLevel which is analogous to disabling interrupts. This will prevent a timer interrupt interrupting it and scheduling another thread switch over the top of the currently executing thread switch procedure -- it ensures that scheduling is atomic.
It checks to see if there are DPCs on the normal queue of the current core or whether there is a timer expiry or deferred ready threads and then calls KiRetireDpcList which basically contains a while queue depth != 0 loop which first checks to see if it is a timer expiry request (which I won't go into now), if not, acquires the DPC queue spinlock, takes a DPC off the queue and parses the members into arguments (interrupts still disabled), decreases queue depth, drops spinlock, enables interrupts and calls the DeferredRoutine. When the DeferredRoutine returns, it disables interrupts again and if there are more in the queue it reacquires the spinlock (spinlock and interrupts disabled ensure that the DPC removal from the queue is atomic so that another interrupt and hence DPC queue drain does not work on the same DPC — it will be already removed from the queue). Since the DPC queue spinlock is not implemented yet on ReactOS we can postulate what might happen on windows: if it fails to acquire the spinlock then given that it's a spinlock and that we are still at DISPATCH_LEVEL and interrupts are disabled, it would spin until the thread on the other core calls KeReleaseSpinLockFromDpcLevel(&DpcData->DpcLock); which is not that much holdup as each thread has the spinlock for about 100 uops I'd say, so we can afford to have interrupts disabled at DISPATCH_LEVEL.
Note that the drain procedure only ever drains the queue of the current core. When the DPC queue is empty, it reenables interrupts and checks to see if there are any deferred ready threads and makes them all ready. It then returns down the callchain to KiInterruptTemplate and then the ISR officially ends.
So, as an overview, in KeInsertQueuedpc, if the DPC to queue is to another core and it is of high priority or the queue depth exceeds the maximum defined in the PRCB then it sets the DpcRequested flag in the PRCB of the core and sends an IPI to the core which most likely runs KiDispatchInterrupt in some way (the ISR could just be the IRQL lower procedure that indeed calls KiDispatchinterrupt) which will drain the DPC queue on that core; the actual wrapper that calls KiDispatchinterrupt may or may not disable the DpcRequested flag in the PRCB like HalpLowerIrql does but I don't know, it may indeed be HalpLowerIrql as I suggested. After KeInsertQueuedpc, when it lowers the IRQL, nothing happens because the DpcRequested flag is in the other core and not the current core. If the DPC to queue is targeted at the current core then if it is of high or medium priority or the queue depth has exceeded the maximum queue depth and the DPC rate is less than the minimum rate defined in the PRCB then it sets the DpcRequested flag in the PRCB and requests a self IPI which will call the same generic wrapper which is used by the scheduler as well so probably something like HalpLowerIrql. After KeInsertQueuedpc it lowers the IRQL with HalpLowerIrql and sees DpcRequested so drains the queue of the current core before lowering IRQL.
Do you see the problem with this though? WRK shows a 'software' interrupt being requested (whose ISR probably calls KiDispatchInterrupt as it is a multi-purpose function and there is only one function that is ever used:
KiRequestSoftwareInterrupt(DISPATCH_LEVEL) in all scenarios) but then ReactOS shows KiDispatchInterrupt being called when the IRQL drops as well. You'd expect that when KiInterruptDispatch drops the ISR spinlock, the function to do so would just check for deferred ready threads or timer expiry request and then just drop the IRQL because the software interrupt to drain the queue will happen as soon as the LAPIC TPR is programmed but ReactOS actually checks for items on the queue (using the flag on the PRCB) and initiates the queue draining in the procedure to lower the IRQL. There is no WRK source code for the spinlock releasing but let's assume that it just doesn't do what happens on ReactOS and lets the 'software' interrupt handle it -- perhaps it leaves that whole DPC queue check out of its equivalent of HalpLowerIrql. But wait a second, what's the Prcb->DpcInterruptRequested for then if it's not used for initiating the queue draining like on ReactOS? Perhaps it is merely used as a control variable so that it doesn't queue 2 software interrupts. We also note that ReactOS also requests a 'software' interrupt at this stage (to arm's Vectored Interrupt Controller) which is extremely odd. So maybe not then. This blatantly does suggests that it gets called twice. It appears that it drains the queue and then the 'software' interrupt comes in immediately after when the IRQL drops (which most likely also calls KiRetireDpcList at some stage) both on ReactOS and WRK and does the same thing. I wonder what anyone makes of that. I mean why both Self IPI and then drain the queue anyway? One of these actions is redundant.
As for lazy IRQL. I see no evidence of it on the WRK or ReactOS, but where it would be implemented would be KiInterruptDispatch. It would be possible to get the current IRQL using KeGetCurrentIrql and then comparing it to the IRQL of the interrupt object and then programming the TPR to correspond to the current IRQL. It either quiesces the interrupt and queues another for that vector using a self IPI or it would just simply switch trap frames.
In Linux Kernel Development book (Robert Love), It is mentioned that :
we must disable local interrupts before obtaining spinlock in
interrupt handler. Otherwise it is possible for an interrupt handler
to interrupt kernel code while the lock is held and attempt to
re-acquire the lock. Which finally can lead to double-acquire
deadlock.
Now my doubt is:
In general, doesn't do_IRQ() disables local interrupt ?
And if lock is acquire, it means thatpreempt_count variable is not zero, which makes that no other handler should get chance, as kernel is not preempt_safe. So how other interrupt handler can work in this situation ?
First, the do_IRQ() function dosn't disable the local interrupt, but some function written in assembly language does, which is the interrupt entrance. And later, before executing the interrupt function registering by request_irq(), in function handle_IRQ_event() a flag which also pass by request_irq() is compare with IRQF_DISABLED to determine whether we should enable the local interrupt when executing the interrupt function. So the answer to your question one is depending on the flags that you pass to the request_irq() function.
Second, preempt_count just means for kernel preemption in process context, but not for interrupt. To avoid interrupt handlers be executed in UP, the only way is involving the irqs_disable(). When the preempt_count is zero, it's said that the kernel can safely does the process switch, otherwise not.
First of all sorry for a little bit ambiguity in Question... What I want to understand is the below scenario
Suppose porcess is running, it holds one lock, Now after acquiring the lock HW interrupt is generated, So How kernel will handle this situation, will it wait for lock ? if yes, what if the interrupt handler need to access that lock or the shared data protected by that lock in process ?
The Linux kernel has a few functions for acquiring spinlocks, to deal with issues like the one you're raising here. In particular, there is spin_lock_irq(), which disables interrupts (on the CPU the process is running on) and acquires the spinlock. This can be used when the code knows interrupts are enabled before the spinlock is acquired; in case the function might be called in different contexts, there is also spin_lock_irqsave(), which stashes away the current state of interrupts before disabling them, so that they can be reenabled by spin_unlock_irqrestore().
In any case, if a lock is used in both process and interrupt context (which is a good and very common design if there is data that needs to be shared between the contexts), then process context must disable interrupts (locally on the CPU it's running on) when acquiring the spinlock to avoid deadlocks. In fact, lockdep ("CONFIG_PROVE_LOCKING") will verify this and warn if a spinlock is used in a way that is susceptible to the "interrupt while process context holds a lock" deadlock.
Let me explain some basic properties of interrupt handler or bottom half.
A handler can’t transfer data to or from user space, because it doesn’t execute in the context of a process.
Handlers also cannot do anything that would sleep, such as calling wait_event, allocating memory with anything other than GFP_ATOMIC, or locking a semaphore
handlers cannot call schedule.
What i am trying to say is that Interrupt handler runs in atomic context. They can not sleep as they cannot be rescheduled. interrupts do not have a backing process context
The above is by design. You can do whatever you want in code, just be prepared for the consequences
Let us assume that you acquire a lock in interrupt handler(bad design).
When an interrupt occur the process saves its register on stack and start ISR. now after acquiring a lock you would be in a deadlock as their is no way ISR know what the process was doing.
The process will not be able to resume execution until it is done it with ISR
In a preemptive kernel the ISR and the process can be preempt but for a non-preemptive kernel you are dead.
I read this article http://www.linuxjournal.com/article/5833 to learn about spinlock. I try this to use it in my kernel driver.
Here is what my driver code needs to do:
In f1(), it will get the spin lock, and caller can call f2() will wait for the lock since the spin lock is not being unlock. The spin lock will be unlock in my interrupt handler (triggered by the HW).
void f1() {
spin_lock(&mylock);
// write hardware
REG_ADDR += FLAG_A;
}
void f2() {
spin_lock(&mylock);
//...
}
The hardware will send the application an interrupt and my interrupt handler will call spin_unlock(&mylock);
My question is if I call
f1()
f2() // i want this to block until the interrupt return saying setting REG_ADDR is done.
when I run this, I get an exception in kernel saying a deadlock " INFO: possible recursive locking detected"
How can I re-write my code so that kernel does not think I have a deadlock?
I want my driver code to wait until HW sends me an interrupt saying setting REG_ADDR is done.
Thank you.
First, since you'll be expecting to block while waiting for the interrupt, you shouldn't be using spinlocks to lock the hardware as you'll probably be holding the lock for a long time. Using a spinlock in this case will waste a lot of CPU cycles if that function is called frequently.
I would first use a mutex to lock access to the hardware register in question so other kernel threads can't simultaneously modify the register. A mutex is allowed to sleep so if it can't acquire the lock, the thread is able to go to sleep until it can.
Then, I'd use a wait queue to block the thread until the interrupt arrives and signals that the bit has finished setting.
Also, as an aside, I noticed you're trying to access your peripheral by using the following expression REG_ADDR += FLAG_A;. In the kernel, that's not the correct way to do it. It may seem to work but will break on some architectures. You should be using the read{b,w,l} and write{b,w,l} macros like
unsigned long reg;
reg = readl(REG_ADDR);
reg |= FLAG_A;
writel(reg, REG_ADDR);
where REG_ADDR is an address you obtained from ioremap.
I will agree with Michael that Spinlock, Semaphores, Mutex ( Or any other Locking Mechanisms) must be used when any of the resources(Memory/variable/piece of code) has the probability of getting shared among the kernel/user threads.
Instead of using any of the Locking primitives available I would suggest using other sleeping functionalities available in kernel like wait_event_interruptibleand wake_up. They are simple and easy to exploit them into your code. You can find its details and exploitation on net.
I am reading following article by Robert Love
http://www.linuxjournal.com/article/6916
that says
"...Let's discuss the fact that work queues run in process context. This is in contrast to the other bottom-half mechanisms, which all run in interrupt context. Code running in interrupt context is unable to sleep, or block, because interrupt context does not have a backing process with which to reschedule. Therefore, because interrupt handlers are not associated with a process, there is nothing for the scheduler to put to sleep and, more importantly, nothing for the scheduler to wake up..."
I don't get it. AFAIK, scheduler in the kernel is O(1), that is implemented through the bitmap. So what stops the scehduler from putting interrupt context to sleep and taking next schedulable process and passing it the control?
So what stops the scehduler from putting interrupt context to sleep and taking next schedulable process and passing it the control?
The problem is that the interrupt context is not a process, and therefore cannot be put to sleep.
When an interrupt occurs, the processor saves the registers onto the stack and jumps to the start of the interrupt service routine. This means that when the interrupt handler is running, it is running in the context of the process that was executing when the interrupt occurred. The interrupt is executing on that process's stack, and when the interrupt handler completes, that process will resume executing.
If you tried to sleep or block inside an interrupt handler, you would wind up not only stopping the interrupt handler, but also the process it interrupted. This could be dangerous, as the interrupt handler has no way of knowing what the interrupted process was doing, or even if it is safe for that process to be suspended.
A simple scenario where things could go wrong would be a deadlock between the interrupt handler and the process it interrupts.
Process1 enters kernel mode.
Process1 acquires LockA.
Interrupt occurs.
ISR starts executing using Process1's stack.
ISR tries to acquire LockA.
ISR calls sleep to wait for LockA to be released.
At this point, you have a deadlock. Process1 can't resume execution until the ISR is done with its stack. But the ISR is blocked waiting for Process1 to release LockA.
I think it's a design idea.
Sure, you can design a system that you can sleep in interrupt, but except to make to the system hard to comprehend and complicated(many many situation you have to take into account), that's does not help anything. So from a design view, declare interrupt handler as can not sleep is very clear and easy to implement.
From Robert Love (a kernel hacker):
http://permalink.gmane.org/gmane.linux.kernel.kernelnewbies/1791
You cannot sleep in an interrupt handler because interrupts do not have
a backing process context, and thus there is nothing to reschedule back
into. In other words, interrupt handlers are not associated with a task,
so there is nothing to "put to sleep" and (more importantly) "nothing to
wake up". They must run atomically.
This is not unlike other operating systems. In most operating systems,
interrupts are not threaded. Bottom halves often are, however.
The reason the page fault handler can sleep is that it is invoked only
by code that is running in process context. Because the kernel's own
memory is not pagable, only user-space memory accesses can result in a
page fault. Thus, only a few certain places (such as calls to
copy_{to,from}_user()) can cause a page fault within the kernel. Those
places must all be made by code that can sleep (i.e., process context,
no locks, et cetera).
Because the thread switching infrastructure is unusable at that point. When servicing an interrupt, only stuff of higher priority can execute - See the Intel Software Developer's Manual on interrupt, task and processor priority. If you did allow another thread to execute (which you imply in your question that it would be easy to do), you wouldn't be able to let it do anything - if it caused a page fault, you'd have to use services in the kernel that are unusable while the interrupt is being serviced (see below for why).
Typically, your only goal in an interrupt routine is to get the device to stop interrupting and queue something at a lower interrupt level (in unix this is typically a non-interrupt level, but for Windows, it's dispatch, apc or passive level) to do the heavy lifting where you have access to more features of the kernel/os. See - Implementing a handler.
It's a property of how O/S's have to work, not something inherent in Linux. An interrupt routine can execute at any point so the state of what you interrupted is inconsistent. If you interrupted the thread scheduling code, its state is inconsistent so you can't be sure you can "sleep" and switch threads. Even if you protect the thread switching code from being interrupted, thread switching is a very high level feature of the O/S and if you protected everything it relies on, an interrupt becomes more of a suggestion than the imperative implied by its name.
So what stops the scehduler from putting interrupt context to sleep and taking next schedulable process and passing it the control?
Scheduling happens on timer interrupts. The basic rule is that only one interrupt can be open at a time, so if you go to sleep in the "got data from device X" interrupt, the timer interrupt cannot run to schedule it out.
Interrupts also happen many times and overlap. If you put the "got data" interrupt to sleep, and then get more data, what happens? It's confusing (and fragile) enough that the catch-all rule is: no sleeping in interrupts. You will do it wrong.
Disallowing an interrupt handler to block is a design choice. When some data is on the device, the interrupt handler intercepts the current process, prepares the transfer of the data and enables the interrupt; before the handler enables the current interrupt, the device has to hang. We want keep our I/O busy and our system responsive, then we had better not block the interrupt handler.
I don't think the "unstable states" are an essential reason. Processes, no matter they are in user-mode or kernel-mode, should be aware that they may be interrupted by interrupts. If some kernel-mode data structure will be accessed by both interrupt handler and the current process, and race condition exists, then the current process should disable local interrupts, and moreover for multi-processor architectures, spinlocks should be used to during the critical sections.
I also don't think if the interrupt handler were blocked, it cannot be waken up. When we say "block", basically it means that the blocked process is waiting for some event/resource, so it links itself into some wait-queue for that event/resource. Whenever the resource is released, the releasing process is responsible for waking up the waiting process(es).
However, the really annoying thing is that the blocked process can do nothing during the blocking time; it did nothing wrong for this punishment, which is unfair. And nobody could surely predict the blocking time, so the innocent process has to wait for unclear reason and for unlimited time.
Even if you could put an ISR to sleep, you wouldn't want to do it. You want your ISRs to be as fast as possible to reduce the risk of missing subsequent interrupts.
The linux kernel has two ways to allocate interrupt stack. One is on the kernel stack of the interrupted process, the other is a dedicated interrupt stack per CPU. If the interrupt context is saved on the dedicated interrupt stack per CPU, then indeed the interrupt context is completely not associated with any process. The "current" macro will produce an invalid pointer to current running process, since the "current" macro with some architecture are computed with the stack pointer. The stack pointer in the interrupt context may point to the dedicated interrupt stack, not the kernel stack of some process.
By nature, the question is whether in interrupt handler you can get a valid "current" (address to the current process task_structure), if yes, it's possible to modify the content there accordingly to make it into "sleep" state, which can be back by scheduler later if the state get changed somehow. The answer may be hardware-dependent.
But in ARM, it's impossible since 'current' is irrelevant to process under interrupt mode. See the code below:
#linux/arch/arm/include/asm/thread_info.h
94 static inline struct thread_info *current_thread_info(void)
95 {
96 register unsigned long sp asm ("sp");
97 return (struct thread_info *)(sp & ~(THREAD_SIZE - 1));
98 }
sp in USER mode and SVC mode are the "same" ("same" here not mean they're equal, instead, user mode's sp point to user space stack, while svc mode's sp r13_svc point to the kernel stack, where the user process's task_structure was updated at previous task switch, When a system call occurs, the process enter kernel space again, when the sp (sp_svc) is still not changed, these 2 sp are associated with each other, in this sense, they're 'same'), So under SVC mode, kernel code can get the valid 'current'. But under other privileged modes, say interrupt mode, sp is 'different', point to dedicated address defined in cpu_init(). The 'current' calculated under these mode will be irrelevant to the interrupted process, accessing it will result in unexpected behaviors. That's why it's always said that system call can sleep but interrupt handler can't, system call works on process context but interrupt not.
High-level interrupt handlers mask the operations of all lower-priority interrupts, including those of the system timer interrupt. Consequently, the interrupt handler must avoid involving itself in an activity that might cause it to sleep. If the handler sleeps, then the system may hang because the timer is masked and incapable of scheduling the sleeping thread.
Does this make sense?
If a higher-level interrupt routine gets to the point where the next thing it must do has to happen after a period of time, then it needs to put a request into the timer queue, asking that another interrupt routine be run (at lower priority level) some time later.
When that interrupt routine runs, it would then raise priority level back to the level of the original interrupt routine, and continue execution. This has the same effect as a sleep.
It is just a design/implementation choices in Linux OS. The advantage of this design is simple, but it may not be good for real time OS requirements.
Other OSes have other designs/implementations.
For example, in Solaris, the interrupts could have different priorities, that allows most of devices interrupts are invoked in interrupt threads. The interrupt threads allows sleep because each of interrupt threads has separate stack in the context of the thread.
The interrupt threads design is good for real time threads which should have higher priorities than interrupts.