Linux Interrupt Handling - linux-kernel

I am trying to understand the Linux interrupt handling mechanism. I tried googling a bit but couldn't find an answer to this one. Can someone please explain it to me why the handle_IRQ_event needs to call local_irq_disable at the end? After this the control goes back to do_irq which eventually will go back to the entry point. Then who will enable the interrupts back.? It is the responsibility of the interrupt handler? If so why is that so?
Edit
Code for reference
asmlinkage int handle_IRQ_event(unsigned int irq, struct pt_regs *regs, struct irqaction *action)
{
int status = 1;
int retval = 0;
if (!(action->flags & SA_INTERRUPT))
local_irq_enable();
do
{
status |= action->flags;
retval |= action->handler(irq, action->dev_id, regs);
action = action->next;
}
while (action);
if (status & SA_SAMPLE_RANDOM)
add_interrupt_randomness(irq);
local_irq_disable();
return retval;
}

The version of handle_IRQ_event from LDD3 appears to come from the 2.6.8 kernel, or possibly earlier. Assuming we're dealing with x86, the processor clears the interrupt flag (IF) in the EFLAGS register before it calls the interrupt handler. The old EFLAGS register will be restored by the iret instruction.
Linux's SA_INTERRUPT IRQ handler flag (now obsolete) determines whether higher priority interrupts are allowed in the interrupt handler. The SA_INTERRUPT flag is set for "fast" interrupt handlers that left interrupts disabled. The SA_INTERRUPT flag is not set for "slow" interrupt handlers that re-enable interrupts.
Regardless of the SA_INTERRUPT flag, do_IRQ itself runs with interrupts disabled and they are still disabled when handle_IRQ_event is called. Since handle_IRQ_event can enable interrupts, the call to local_irq_disable at the end ensures they are disabled again on return to do_IRQ.
The relevant source code files in the 2.6.8 kernel for i386 architecture are arch/i386/kernel/entry.S, and arch/i386/kernel/irq.c.

Related

How are software interrupts triggered in windows when the IRQL drops?

I know that for hardware interrupts, when KeLowerIrql is called by KeAcquireInterruptSpinLock, the HAL adjusts the interrupt mask in the LAPIC, which will allow queued interrupts (in the IRR probably) to be serviced automatically. But with software interrupts, for instance, ntdll.dll sysenter calls to the SSDT NtXxx system services, how are they 'postponed' and triggered when the IRQL goes to passive level Same goes for the DPC dispatcher software interrupt (if the DPC is for the current CPU and of high priority), how is that triggered when IRQL < Dispatch IRQL? Do the software interrupt called functions (NtXxx) in the SSDT all loop on a condition i.e.
while (irql != passive)
Exactly the same question for lazy IRQL:
Because accessing a PIC is a relatively slow operation, HALs that require accessing the I/O bus to change IRQLs, such as for PIC and 32-bit Advanced Configuration and Power Interface (ACPI) systems, implement a performance optimization, called lazy IRQL, that avoids PIC accesses. When the IRQL is raised, the HAL notes the new IRQL internally instead of changing the interrupt mask. If a lower-priority interrupt subsequently occurs, the HAL sets the interrupt mask to the settings appropriate for the first interrupt and does not quiesce the lower-priority interrupt until the IRQL is lowered (thus keeping the interrupt pending). Thus, if no lower-priority interrupts occur while the IRQL is raised, the HAL doesn’t need to modify the PIC.
How does it keep this interrupt pending? Does it just loop on a condition until the higher priority ISR lowers the IRQL and when the thread is scheduled in, the condition will eventually be met? Is it just that simple?
Edit: I must be missing out on something here because let's say an ISR at Device IRQL requests a DPC using IoRequestDpc, if it is a high priority DPC and the target is the current processor then it schedules an interrupt of DPC/Dispatch level to drain the processor's DPC queue. This is all happening in the ISR which is at Device IRQL (DIRQL) which means that the software interrupt with Dispatch/DPC IRQL level will spin at KeAcquireInterruptSpinLock I think because the current IRQL is too high, but wouldn't it be spinning there forever because the actual routine to lower the IRQL is called after the ISR returns meaning that it's going to stay stuck in the ISR at Device IRQL waiting on that software interrupt which requires IRQL < Dispatch/DPC IRQL (2), not only that, the dispatcher will not be able to dispatch the next thread because the dispatch DPC runs at Dispatch/DPC IRQL level which is far lower. There 1 solution I can think of.
1) The ISR returns the KDPC object to the KiInterruptDispatch so that it knows what priority the DPC is and then schedules it itself after it has lowered the IRQL using KeReleaseInterruptSpinLock but KSERVICE_ROUTINE only returns an unrelated boolean value so this is ruled out.
Does anyone know how this situation is avoided?
Edit 2: Perhaps it spawns a new thread that blocks waiting for IRQL < Dispatch IRQL and then returns from the ISR and drops the IRQL.
This is something that isn't really explained explicitly on any source and interestingly enough the second comment also asks the same question.
Firstly, DPC software interrupts aren't like regular SSDT software interrupts, which are not postponed and run at passive IRQL and can be interrupted at any time. DPC software interrupts do not use int or syscall or anything like that, are postponed and run at dispatch level.
After studying the ReactOS kernel and WRK, I now know exactly what happens
A driver, when it receives IRP_MN_START_DEVICE from the PnP manager, initialises an interrupt object using IoConnectInterrupt using the data in the CM_RESOURCE_LIST it receives in the IRP. Of particular interest is the vector and affinity that was assigned by the PnP manager to the device (which is simple to do if the device exposes an MSI capability in its PCIe configuration space as it doesn't have to worry about underlying IRQ routing). It passes the vector, a pointer to an ISR, context for the ISR, IRQL to IoConnectInterrupt which calls KeInitializeInterrupt to initialise the interrupt object using the parameters and then it calls KeConnectInterrupt which switches the affinity of the current thread to the target processor, locks the dispatcher database and checks that that IDT entry points to a BugCheck wrapper KxUnexpectedInterrupt0[IdtIndex]. If it is then it raises IRQL to 31 so the following is an atomic operation and uses the HAL API to enable the vector that was mapped by the PnP manager on the LAPIC and assign it a TPR priority level corresponding to the IRQL. It then maps the vector to the handler address in the IDT entry for ther vector. To do this it passes the address &Interrupt->DispatchCode[0] into the IDT mapping routine KeSetIdtHandlerAddress. It appears this is a template which is the same for all interrupt objects which according to the WRK is KiInterruptTemplate. Sure enough, checking ReactOS kernel, we see in KeInitializeInterrupt -- which is called by IoConnectInterrupt -- the code:
RtlCopyMemory(Interrupt->DispatchCode,
KiInterruptDispatchTemplate,
sizeof(Interrupt->DispatchCode));
KiInterruptDispatchTemplate appears to be blank for now, because ReactOS's amd64 port is in early development. On windows it will be implemented however and as the KiInterruptTemplate.
It then lowers the IRQL back to the old IRQL. If the IDT entry did not point to a BugCheck ISR then it initialises a chained interrupt -- because there was already an address at the IDT entry. It uses CONTAINING_RECORD to acquire the interrupt object by its member, the address of the handler (DispatchCode[0]) and connects the new interrupt object to the one already present, initialising the interrupt object already referenced's LIST_ENTRY as the head of the list and marking it as a chained interrupt by setting the DispatchAddress member to the address of KiChainedDispatch. It then drops the dispatcher database spinlock and switches the affinity back and returns the interrupt object.
The driver then sets up a DPC -- with the DeferredRoutine as a member -- for the Device Object using IoInitializeDpcRequest.
FORCEINLINE VOID IoInitializeDpcRequest ( _In_ PDEVICE_OBJECT DeviceObject, _In_ PIO_DPC_ROUTINE DpcRoutine )
KeInitializeDpc(&DeviceObject->Dpc,
(PKDEFERRED_ROUTINE) DpcRoutine,
DeviceObject);
KeInitializeDpc calls KiInitializeDpc which is hard-coded to set the priority to medium which means that KeInsertQueueDpc will place it in the middle of the DPC queue. KeSetImportanceDpc and KeSetTargetProcessorDpc can be used after the call to set the returned DPC that was generated's priority and target processor respectively. It copies a DPC object to the member of the device object and if there is already a DPC object then it queues it to the DPC already present.
When the interrupt happens, the KiInterruptTemplate template of the interrupt object is the address in the IDT that gets called which will then call the real interrupt dispatcher which is the DispatchAddress member which will be KiInterruptDispatch for a normal interrupt or KiChainedDispatch for a chained interrupt. It passes the interrupt object to KiInterruptDispatch (it can do this because, as we saw earlier, RtlCopyMemory copied KiInterruptTemplate into the interrupt object, this means that it can use an asm block with a relative RIP to acquire the address of the interrupt object it belongs to (it could also attempt to do something with the CONTAINING_RECORD function) but intsup.asm contains the following code to do it : lea rbp, KiInterruptTemplate - InDispatchCode ; get interrupt object address jmp qword ptr InDispatchAddress[rbp]; finish in common code). KiInterruptDispatch will then acquire the interrupt's spinlock, probably using KeAcquireInterruptSpinLock. The ISR (ServiceContext) calls IoRequestDpc with the device object address that was created for the device and ISR, as a parameter, along with interrupt specific context and an optional IRP (which I'm guessing it gets from the head at DeviceObject->Irp if the routine is meant to handle an IRP). I expected it to be a single line wrapper of KeInsertQueue but passing the Dpc member of the device object instead and that's exactly what it is: KeInsertQueueDpc(&DeviceObject->Dpc, Irp, Context);. Firstly KeInsertQueue raises the IRQL from the device IRQL of the ISR to 31 which prevents all preemption. WRK contains the following on line 263 of dpcobj.c:
#if !defined(NT_UP)
if (Dpc->Number >= MAXIMUM_PROCESSORS) {
Number = Dpc->Number - MAXIMUM_PROCESSORS;
TargetPrcb = KiProcessorBlock[Number];
} else {
Number = CurrentPrcb->Number;
TargetPrcb = CurrentPrcb;
}
Which suggests that the DPC->Number member must be set by KeSetTargetProcessorDpc as target core number + maximum processors. This is bizarre and sure enough I went and looked at ReactOS's KeSetTargetProcessorDpc and it does! KiProcessorBlock appears to be a kernel structure for fast-accessing the KPRCB structures for each of the cores.
It then gets the core's normal DPC queue spinlock using DpcData = KiSelectDpcData(TargetPrcb, Dpc) which returns &Prcb->DpcData[DPC_NORMAL] as the type of the DPC it passed to it is normal, not threaded. It then acquires the spinlock for the queue and this appears to be an empty function body on ReactOS and I think it's because of this:
/* On UP builds, spinlocks don't exist at IRQL >= DISPATCH */
And that makes sense because ReactOS only supports 1 core meaning there is no thread on another core that can access the DPC queue (a core might have a target DPC for this core's queue). There is only one DPC queue. If it were a multicore system it would have to acquire the spinlock so these look to be placeholders for when multicore functionality is implemented. If it failed to acquire the spinlock for the DPC queue then it would either spin-wait at IRQL 31 or drop to the IRQL of the interrupt itself and spinwait, allowing other interrupts to occur to the core but no other threads to run on the core.
Note that windows would use KeAcquireSpinLockAtDpcLevel to acquire this spinlock, ReactOS does not. KeAcquireSpinLockAtDpcLevel does not touch the IRQL. Although, in the WRK it directly uses KiAcquireSpinLock which can be seen on line 275 of dpcobj.c which only acquires the spinlock and does nothing to the IRQL (KiAcquireSpinLock(&DpcData->DpcLock);).
After acquiring the spinlock it firstly ensures that the DPC object isn't already on a queue (DpcData member would be null when it does a cmpxchg to initialise it with the DpcData returned from KiSelectDpcData(TargetPrcb, Dpc)) and if it is it drops the spinlock and returns; otherwise, it then sets the DPC members to point to the interrupt specific context that was passed and then it inserts it into the queue either at the head (InsertHeadList(&DpcData->DpcListHead, &Dpc->DpcListEntry);) or the tail (InsertTailList(&DpcData->DpcListHead, &Dpc->DpcListEntry);)based on its priority (if (Dpc->Importance == HighImportance)). It then makes sure that a DPC isn't executing already if (!(Prcb->DpcRoutineActive) && !(Prcb->DpcInterruptRequested)). It then checks if KiSelectDpcData returned the second KDPC_DATA structure i.e. the DPC was of type threaded (if (DpcData == &TargetPrcb->DpcData[DPC_THREADED])) and if it is and if ((TargetPrcb->DpcThreadActive == FALSE) && (TargetPrcb->DpcThreadRequested == FALSE)) then it does a locked xchg to set TargetPrcb->DpcSetEventRequest to true respectively and then it sets TargetPrcb->DpcThreadRequested and TargetPrcb->QuantumEnd to true and it sets RequestInterrupt to true if the target PRCB is the current PRCB otherwise it only sets it to true if the target core is not idle.
Now comes the crux of the original question. The WRK now contains the following code:
#if !defined(NT_UP)
if (CurrentPrcb != TargetPrcb) {
if (((Dpc->Importance == HighImportance) ||
(DpcData->DpcQueueDepth >= TargetPrcb->MaximumDpcQueueDepth))) {
if (((KiIdleSummary & AFFINITY_MASK(Number)) == 0) ||
(KeIsIdleHaltSet(TargetPrcb, Number) != FALSE)) {
TargetPrcb->DpcInterruptRequested = TRUE;
RequestInterrupt = TRUE;
}
}
} else {
if ((Dpc->Importance != LowImportance) ||
(DpcData->DpcQueueDepth >= TargetPrcb->MaximumDpcQueueDepth) ||
(TargetPrcb->DpcRequestRate < TargetPrcb->MinimumDpcRate)) {
TargetPrcb->DpcInterruptRequested = TRUE;
RequestInterrupt = TRUE;
}
}
#endif
In essence, on a multiprocessor system, if the target core it acquired from the DPC object is not the current core of the thread then: If the DPC is of high importance or it exceeds the maximum queue depth and the logical and of the target affinity and the idle cores is 0 (i.e. the target core is not idle) and (well, KeIsIdleHaltSet appears to to be exactly the same thing (it checks the Sleeping flag in the target PRCB)) then it sets a DpcInterruptRequested flag in the PRCB of the target core. If the target of the DPC is the current core then if the DPC is not low importance (note: this would allow medium!) or if the DPC queue depth exceeds the maximum queue depth and if the request rate of DPCs on the core hasn't exceeded the minimum it sets a flag in the PRCB of the current core to indicate there is a DPC.
It now releases the DPC queue spinlock: KiReleaseSpinLock(&DpcData->DpcLock);(#if !defined(NT_UP) of course) (which doesn't alter the IRQL). It then checks to see if an interrupt was requested by the procedure (if (RequestInterrupt == TRUE)), then if it is a uniprocessor system (#if defined(NT_UP)) it simply calls KiRequestSoftwareInterrupt(DISPATCH_LEVEL); but if it is a multicore system it needs to check the target PRCB to see if it needs to send an IPI.
if (TargetPrcb != CurrentPrcb) {
KiSendSoftwareInterrupt(AFFINITY_MASK(Number), DISPATCH_LEVEL);
} else {
KiRequestSoftwareInterrupt(DISPATCH_LEVEL);
}
And it speaks for itself what that does; if the current PRCB is not the target PRCB of the DPC then it sends an IPI of DISPATCH_LEVEL priority to the processor number using KiSendSoftwareInterrupt; otherwise, it uses KiRequestSoftwareInterrupt. There is no documentation at all but my guess is this is a Self IPI, and it will wrap a HAL function that programs the ICR to send an IPI to itself at dispatch level priority (my reasoning being ReactOS at this stage calls HalRequestSoftwareInterrupt which shows an unimplemented PIC write). So it's not a software interrupt in the INT sense but is actually, put simply, a hardware interrupt. It then lowers the IRQL back from 31 to the previous IRQL (which was the ISR IRQL). It then returns to the ISR and then it will return to KiInterruptDispatch; KiInterruptDispatch will then release the ISR spinlock using KeReleaseInterruptSpinLock which will reduce the IRQL to what it was before the interrupt and it then pop the trap frame but I would have thought it would first pop the trap frame and then program the LAPIC TPR so the register restore process is atomic but I suppose it doesn't really matter.
ReactOS has the following (WRK doesn't have KeReleaseSpinlock or the IRQL lowering procedures documented so this is the best we have):
VOID NTAPI KeReleaseSpinLock ( KIRQL NewIrql )
{
/* Release the lock and lower IRQL back */
KxReleaseSpinLock(SpinLock);
KeLowerIrql(OldIrql);
}
VOID FASTCALL KfReleaseSpinLock ( PKSPIN_LOCK SpinLock, KIRQL OldIrql )
{
/* Simply lower IRQL back */
KeLowerIrql(OldIrql);
}
KeLowerIrql is a wrapper for the HAL function KfLowerIrql, the function contains KfLowerIrql(OldIrql); and that's it.
VOID FASTCALL KfLowerIrql ( KIRQL NewIrql )
{
DPRINT("KfLowerIrql(NewIrql %d)\n", NewIrql);
if (NewIrql > KeGetPcr()->Irql)
{
DbgPrint ("(%s:%d) NewIrql %x CurrentIrql %x\n",
__FILE__, __LINE__, NewIrql, KeGetPcr()->Irql);
KeBugCheck(IRQL_NOT_LESS_OR_EQUAL);
for(;;);
}
HalpLowerIrql(NewIrql);
}
This function basically prevents the new IRQL being higher than the current IRQL which makes sense because the function is supposed to lower the IRQL. If everything is ok, the function calls HalpLowerIrql(NewIrql); This is a skeleton of a multiprocessor AMD64 implementation -- it does not actually implement the APIC register writes (or MSRs for x2APIC), they are empty functions on ReactOS's multiprocessor AMD64 implementation as it is in development; but on windows, they wont be and they'll actually program the LAPIC TPR so that the queued software interrupt can now occur.
HalpLowerIrql(KIRQL NewIrql, BOOLEAN FromHalEndSystemInterrupt)
{
ULONG Flags;
UCHAR DpcRequested;
if (NewIrql >= DISPATCH_LEVEL)
{
KeSetCurrentIrql (NewIrql);
APICWrite(APIC_TPR, IRQL2TPR (NewIrql) & APIC_TPR_PRI);
return;
}
Flags = __readeflags();
if (KeGetCurrentIrql() > APC_LEVEL)
{
KeSetCurrentIrql (DISPATCH_LEVEL);
APICWrite(APIC_TPR, IRQL2TPR (DISPATCH_LEVEL) & APIC_TPR_PRI);
DpcRequested = __readfsbyte(FIELD_OFFSET(KIPCR, HalReserved[HAL_DPC_REQUEST]));
if (FromHalEndSystemInterrupt || DpcRequested)
{
__writefsbyte(FIELD_OFFSET(KIPCR, HalReserved[HAL_DPC_REQUEST]), 0);
_enable();
KiDispatchInterrupt();
if (!(Flags & EFLAGS_INTERRUPT_MASK))
{
_disable();
}
}
KeSetCurrentIrql (APC_LEVEL);
}
if (NewIrql == APC_LEVEL)
{
return;
}
if (KeGetCurrentThread () != NULL &&
KeGetCurrentThread ()->ApcState.KernelApcPending)
{
_enable();
KiDeliverApc(KernelMode, NULL, NULL);
if (!(Flags & EFLAGS_INTERRUPT_MASK))
{
_disable();
}
}
KeSetCurrentIrql (PASSIVE_LEVEL);
}
Firstly, it checks to see if the new IRQL is above dispatch level, if so, it sets it to it just fine and writes to the LAPIC TPR register and returns. If not, it checks to see if the current IRQL is dispatch level (>APC_LEVEL). It means that by definition, the new IRQL is going to be less than dispatch level. We can see that in this event it makes it equal to DISPATCH_LEVEL rather than letting it drop below and writes it to the LAPIC TPR register. It then checks is HalReserved[HAL_DPC_REQUEST] which appears to be what ReactOS uses instead of DpcInterruptRequested which we saw previously, so just substitute it with that. It then sets it to 0 (note the PCR begins at the start of segment descriptor pointed to by the FS segment in kernel mode). It then enables interrupts and calls KiDispatchInterrupt and after that if the eflags register changed the IF flag during KiDispatchInterrupt it disables interrupts. It then also checks to see if a kernel APC is pending (which is beyond the scope of this explanation) before finally setting the IRQL to passive level
VOID NTAPI KiDispatchInterrupt ( VOID )
{
PKIPCR Pcr = (PKIPCR)KeGetPcr();
PKPRCB Prcb = &Pcr->Prcb;
PKTHREAD NewThread, OldThread;
/* Disable interrupts */
_disable();
/* Check for pending timers, pending DPCs, or pending ready threads */
if ((Prcb->DpcData[0].DpcQueueDepth) ||
(Prcb->TimerRequest) ||
(Prcb->DeferredReadyListHead.Next))
{
/* Retire DPCs while under the DPC stack */
//KiRetireDpcListInDpcStack(Prcb, Prcb->DpcStack);
// FIXME!!! //
KiRetireDpcList(Prcb);
}
/* Re-enable interrupts */
_enable();
/* Check for quantum end */
if (Prcb->QuantumEnd)
{
/* Handle quantum end */
Prcb->QuantumEnd = FALSE;
KiQuantumEnd();
}
else if (Prcb->NextThread)
{
/* Capture current thread data */
OldThread = Prcb->CurrentThread;
NewThread = Prcb->NextThread;
/* Set new thread data */
Prcb->NextThread = NULL;
Prcb->CurrentThread = NewThread;
/* The thread is now running */
NewThread->State = Running;
OldThread->WaitReason = WrDispatchInt;
/* Make the old thread ready */
KxQueueReadyThread(OldThread, Prcb);
/* Swap to the new thread */
KiSwapContext(APC_LEVEL, OldThread);
}
}
Firstly, it disables interrupts _disable is just a wrapper of an asm block that clears the IF flag and has memory and cc in the clobber list (to prevent compiler reordering). This looks like arm syntax though.
{
__asm__ __volatile__
(
"cpsid i # __cli" : : : "memory", "cc"
);
}
This ensures that it can drain the DPC queue as an uninterrupted procedure; as with interrupts disabled, it cannot be interrupted by a clock interrupt and rescheduled. This prevents the scenario of 2 schedulers running at the same time for instance if a thread yielded with Sleep() it ends up calling KeRaiseIrqlToSynchLevel which is analogous to disabling interrupts. This will prevent a timer interrupt interrupting it and scheduling another thread switch over the top of the currently executing thread switch procedure -- it ensures that scheduling is atomic.
It checks to see if there are DPCs on the normal queue of the current core or whether there is a timer expiry or deferred ready threads and then calls KiRetireDpcList which basically contains a while queue depth != 0 loop which first checks to see if it is a timer expiry request (which I won't go into now), if not, acquires the DPC queue spinlock, takes a DPC off the queue and parses the members into arguments (interrupts still disabled), decreases queue depth, drops spinlock, enables interrupts and calls the DeferredRoutine. When the DeferredRoutine returns, it disables interrupts again and if there are more in the queue it reacquires the spinlock (spinlock and interrupts disabled ensure that the DPC removal from the queue is atomic so that another interrupt and hence DPC queue drain does not work on the same DPC — it will be already removed from the queue). Since the DPC queue spinlock is not implemented yet on ReactOS we can postulate what might happen on windows: if it fails to acquire the spinlock then given that it's a spinlock and that we are still at DISPATCH_LEVEL and interrupts are disabled, it would spin until the thread on the other core calls KeReleaseSpinLockFromDpcLevel(&DpcData->DpcLock); which is not that much holdup as each thread has the spinlock for about 100 uops I'd say, so we can afford to have interrupts disabled at DISPATCH_LEVEL.
Note that the drain procedure only ever drains the queue of the current core. When the DPC queue is empty, it reenables interrupts and checks to see if there are any deferred ready threads and makes them all ready. It then returns down the callchain to KiInterruptTemplate and then the ISR officially ends.
So, as an overview, in KeInsertQueuedpc, if the DPC to queue is to another core and it is of high priority or the queue depth exceeds the maximum defined in the PRCB then it sets the DpcRequested flag in the PRCB of the core and sends an IPI to the core which most likely runs KiDispatchInterrupt in some way (the ISR could just be the IRQL lower procedure that indeed calls KiDispatchinterrupt) which will drain the DPC queue on that core; the actual wrapper that calls KiDispatchinterrupt may or may not disable the DpcRequested flag in the PRCB like HalpLowerIrql does but I don't know, it may indeed be HalpLowerIrql as I suggested. After KeInsertQueuedpc, when it lowers the IRQL, nothing happens because the DpcRequested flag is in the other core and not the current core. If the DPC to queue is targeted at the current core then if it is of high or medium priority or the queue depth has exceeded the maximum queue depth and the DPC rate is less than the minimum rate defined in the PRCB then it sets the DpcRequested flag in the PRCB and requests a self IPI which will call the same generic wrapper which is used by the scheduler as well so probably something like HalpLowerIrql. After KeInsertQueuedpc it lowers the IRQL with HalpLowerIrql and sees DpcRequested so drains the queue of the current core before lowering IRQL.
Do you see the problem with this though? WRK shows a 'software' interrupt being requested (whose ISR probably calls KiDispatchInterrupt as it is a multi-purpose function and there is only one function that is ever used:
KiRequestSoftwareInterrupt(DISPATCH_LEVEL) in all scenarios) but then ReactOS shows KiDispatchInterrupt being called when the IRQL drops as well. You'd expect that when KiInterruptDispatch drops the ISR spinlock, the function to do so would just check for deferred ready threads or timer expiry request and then just drop the IRQL because the software interrupt to drain the queue will happen as soon as the LAPIC TPR is programmed but ReactOS actually checks for items on the queue (using the flag on the PRCB) and initiates the queue draining in the procedure to lower the IRQL. There is no WRK source code for the spinlock releasing but let's assume that it just doesn't do what happens on ReactOS and lets the 'software' interrupt handle it -- perhaps it leaves that whole DPC queue check out of its equivalent of HalpLowerIrql. But wait a second, what's the Prcb->DpcInterruptRequested for then if it's not used for initiating the queue draining like on ReactOS? Perhaps it is merely used as a control variable so that it doesn't queue 2 software interrupts. We also note that ReactOS also requests a 'software' interrupt at this stage (to arm's Vectored Interrupt Controller) which is extremely odd. So maybe not then. This blatantly does suggests that it gets called twice. It appears that it drains the queue and then the 'software' interrupt comes in immediately after when the IRQL drops (which most likely also calls KiRetireDpcList at some stage) both on ReactOS and WRK and does the same thing. I wonder what anyone makes of that. I mean why both Self IPI and then drain the queue anyway? One of these actions is redundant.
As for lazy IRQL. I see no evidence of it on the WRK or ReactOS, but where it would be implemented would be KiInterruptDispatch. It would be possible to get the current IRQL using KeGetCurrentIrql and then comparing it to the IRQL of the interrupt object and then programming the TPR to correspond to the current IRQL. It either quiesces the interrupt and queues another for that vector using a self IPI or it would just simply switch trap frames.

Interrupt handling in Device Driver

I have written a simple character driver and requested IRQ on a gpio pin and wrtten a handler for it.
err = request_irq( irq, irq_handler,IRQF_SHARED | IRQF_TRIGGER_RISING, INTERRUPT_DEVICE_NAME, raspi_gpio_devp);
static irqreturn_t irq_handler(int irq, void *arg);
now from theory i know that Upon interrupt the interrupt Controller with tell the processor to call do_IRQ() which will check the IDT and call my interrupt handler for this line.
how does the kernel know that the interrupt handler was for this particular device file
Also I know that Interrupt handlers do not run in any process context. But let say I am accessing any variable declared out side scope of handler, a static global flag = 0, In the handler I make flag = 1 indicating that an interrupt has occurred. That variable is in process context. So I am confused how this handler not in any process context modify a variable in process context.
Thanks
The kernel does not know that this particular interrupt is for a particular device.
The only thing it knows is that it must call irq_handler with raspi_gpio_devp as a parameter. (like this: irq_handler(irq, raspi_gpio_devp)).
If your irq line is shared, you should check if your device generated an IRQ or not. Code:
int irq_handler(int irq, void* dev_id) {
struct raspi_gpio_dev *raspi_gpio_devp = (struct raspi_gpio_dev *) dev_id;
if (!my_gpio_irq_occured(raspi_gpio_devp))
return IRQ_NONE;
/* do stuff here */
return IRQ_HANDLED;
}
The interrupt handler runs in interrupt context. But you can access static variables declared outside the scope of the interrupt.
Usually, what an interrupt handler does is:
check interrupt status
retrieve information from the hardware and store it somewhere (a buffer/fifo for example)
wake_up() a kernel process waiting for that information
If you want to be really confident with the do and don't of interrupt handling, the best thing to read about is what a process is for the kernel.
An excellent book dealing with this is Linux Kernel Developpement by Robert Love.
The kernel doesn't know which device the interrupt pertains to. It is possible for a single interrupt to be shared among multiple devices. Previously this was quite common. It is becoming less so due to improved interrupt support in interrupt controllers and introduction of message-signaled interrupts. Your driver must determine whether the interrupt was from your device (i.e. whether your device needs "service").
You can provide context to your interrupt handler via the "void *arg" provided. This should never be process-specific context, because a process might exit leaving pointers dangling (i.e. referencing memory which has been freed and/or possibly reallocated for other purposes).
A global variable is not "in process context". It is in every context -- or no context if you prefer. When you hear "not in process context", that means a few things: (1) you cannot block/sleep (because what process would you be putting to sleep?), (2) you cannot make any references to user-space virtual addresses (because what would those references be pointing to?), (3) you cannot make references to "current task" (since there isn't one or it's unknown).
Typically, a driver's interrupt handler pushes or pulls data into "driver global" data areas from which/to which the process context end of the driver can transfer data.
This is to reply your question :-
how does the kernel know that the interrupt handler was for this particular >device file?
Each System-On-Chip documents will mention interrupt numbers for different devices connected to different interrupt lines.
The Same Interrupt number has to be mentioned in the Device Tree entry for instantiation of device driver.
The Device driver's usual probe function parses the Device tree data structure and reads the IRQ number and registers the handler using the register_irq function.
If there are multiple devices to a single IRQ number/line, then the IRQ status register(for different devices if mapped under the same VM space) can be used inside the IRQ handler to differentiate.
Please read more in my blog

Replace HW interrupt in flat memory mode with DOS32/A

I have a question about how to replace HW interrupt in flat memory mode...
about my application...
created by combining Watcom C and DOS32/A.
written for running on DOS mode( not on OS mode )
with DOS32/A now I can access >1M memory and allocate large memory to use...(running in flat memory mode !!!)
current issue...
I want to write an ISR(interrupt service routine) for one PCI card. Thus I need to "replace" the HW interrupt.
Ex. the PCI card's interrupt line = 0xE in DOS. That means this device will issue interrupt via 8259's IRQ 14.
But I did not how to achieve my goal to replace this interrupt in flat mode ?
# resource I found...
- in watcom C's library, there is one sample using _dos_getvect, _dos_setvect, and _chain_intr to hook INT 0x1C...
I tested this code and found OK. But when I apply it to my case: INT76 ( where IRQ 14 is "INT 0x76" <- (14-8) + 0x70 ) then nothing happened...
I checked HW interrupt is generated but my own ISR did not invoked...
Do I lose something ? or are there any functions I can use to achieve my goal ?
===============================================================
[20120809]
I tried to use DPMI calls 0x204 and 0x205 and found MyISR() is still not invoked. I described what I did as below and maybe you all can give me some suggestions !
1) Use inline assembly to implement DPMI calls 0x204 and 0x205 and test OK...
Ex. Use DPMI 0x204 to show the interrupt vectors of 16 IRQs and I get(selector:offset) following results: 8:1540(INT8),8:1544(INT9),.....,8:1560(INT70),8:1564(INT71),...,8:157C(INT77)
Ex. Use DPMI 0x205 to set the interrupt vector for IRQ14(INT76) and returned CF=0, indicating successful
2) Create my own ISR MyISR() as follows:
volatile int tick=0; // global and volatile...
void MyISR(void)
{
tick = 5; // simple code to change the value of tick...
}
3) Set new interrupt vector by DPMI call 0x205:
selector = FP_SEG(MyISR); // selector = 0x838 here
offset = FP_OFF(MyISR); // offset = 0x30100963 here
sts = DPMI_SetIntVector(0x76, selector, offset, &out_ax);
Then sts = 0(CF=0) indicating successful !
One strange thing here is:my app runs in flat memory model and I think the selector should be 0 for MyISR()... But if selector = 0 for DPMI call 0x205 then I got CF=1 and AX = 0x8022, indicating "invalid selector" !
4) Let HW interrupt be generated and the evidences are:
PCI device config register 0x5 bit2(Interrupt Disabled) = 0
PCI device config register 0x6 bit3(Interrupt status) = 1
PCI device config register 0x3C/0x3D (Interrupt line) = 0xE/0x2
In DOS the interrupt mode is PIC mode(8259 mode) and Pin-based(MSIE=0)
5) Display the value of tick and found it is still "0"...
Thus I think MyISR() is not invoked correctly...
Try using DPMI Function 0204h and 0205h instead of '_dos_getvect' and '_dos_setvect', respectively.
The runtime environment of your program is DOS32A or a DPMI Server/host. So use the api they provided instead of using DOS int21h facilities. But DOS32A does intercepts int21h interrupts, so your code should work fine, as far as real mode is concerned.
Actually what you did is you install only real mode interrupt handler for IRQ14 by using '_dos_getvect' and '_dos_setvect' functions.
By using the DPMI functions instead, you install protected mode interrupt handler for IRQ14, and DOS32a will autopassup IRQ14 interrupt to this protected mode handler.
Recall: A dos extender/DPMI server can be in protected mode or real mode while an IRQ is asserted.
This is becoz your application uses some DOS or BIOS API, so extender needs to switch to real mode to execute them and the return back to protected mode to transfer control to you protected mode application.
DOS32a does this by allocating a real-mode callback (at least for hardware interrupts) which calls your protected mode handler if IRQ14 is asserted while the Extender is in real-mode.
If the extender is in protected mode, while IRQ14 is asserted, it will automatically transfer control to your IRQ14 handler.
But if you didn't install protected mode handler for your IRQ, then DOS32a, will not allocate any real-mode callback, and your real-mode irq handler may not get control.
But it should recieve control AFAIK.
Anyway give the above two functions a try. And do chain to the previous int76h interrupt handler as Sean said.
In short:
In case of DOS32a, you need not use '_dos_getvect' and '_dos_setvect' functions. Instead use the DPMI functions 0204h and 0205h for installing your protected mode IRQ handler.
An advise : In your interrupt handler the first step should be to check whether your device actually generated interrupt or it is some other device sharing this irq(IRQ14 in your case). You can do this by checking a 'interrupt pending bit' in your device, if it is set, service your device and chain to next handler. If it is not set to 1, simply chain to next handler.
EDITED:
Use the latest version of DOS32a, instead of one that comes with OW.
Update on 2012-08-14:
Yes, you can use FP_SEG and FP_OFF macros for obtaining selector and offset respectively, just like you would use these macros in real modes to get segment and offset.
You can also use MK_FP macro to create far pointers from selector and offset. eg.
MK_FP(selector, offset).
You should declare your interrupt handler with ' __interrupt ', keyword when writing handlers in C.
Here is a snippet:
#include <i86.h> /* for FP_OFF, FP_SEG, and MK_FP in OW */
/* C Prototype for your IRQ handler */
void __interrupt __far irqHandler(void);
.
.
.
irq_selector = (unsigned short)FP_SEG( &irqHandler );
irq_offset = (unsigned long)FP_OFF( &irqHandler );
__dpmi_SetVect( intNum, irq_selector, irq_offset );
.
.
.
or, try this:
extern void sendEOItoMaster(void);
# pragma aux sendEOItoMaster = \
"mov al, 0x20" \
"out 0x20, al" \
modify [eax] ;
extern void sendEOItoSlave(void);
# pragma aux sendEOItoSlave = \
"mov al, 0x20" \
"out 0xA0, al" \
modify [eax] ;
unsigned int old76_selector, new76_selector;
unsigned long old76_offset, new76_offset;
volatile int chain = 1; /* Chain to the old handler */
volatile int tick=0; // global and volatile...
void (__interrupt __far *old76Handler)(void) = NULL; // function pointer declaration
void __interrupt __far new76Handler(void) {
tick = 5; // simple code to change the value of tick...
.
.
.
if( chain ){
// disable irqs if enabled above.
_chain_intr( old76Handler ); // 'jumping' to the old handler
// ( *old76Handler )(); // 'calling' the old handler
}else{
sendEOItoMaster();
sendEOItoSlave();
}
}
__dpmi_GetVect( 0x76, &old76_selector, &old76_offset );
old76Handler = ( void (__interrupt __far *)(void) ) MK_FP (old76_selector, old76_offset)
new76_selector = (unsigned int)FP_SEG( &new76Handler );
new76_offset = (unsigned long)FP_OFF( &new76Handler );
__dpmi_SetVect( 0x76, new76_selector, new76_offset );
.
.
NOTE:
You should first double check that the IRQ# you are hooking is really assigned/mapped to the interrupt pin of your concerned PCI device. IOWs, first read 'Interrupt Line register' (NOT Interrupt Pin register) from PCI configuration space, and hook only that irq#. The valid values for this register, in your case are: 0x00 through 0x0F inclusive, with 0x00 means IRQ0 and 0x01 means IRQ1 and so on.
POST/BIOS code writes a value in 'Interrupt Line register', while booting, and you MUST NOT modify this register at any cost.(of course, unless you are dealing with interrupt routing issues which an OS writer will deal with)
You should also get and save the selector and offset of the old handler by using DPMI call 0204h, in case you are chaining to old handler. If not, don't forget to send EOI(End-of-interrupt) to BOTH master and slave PICs in case you hooked an IRQ belonging to slave PIC(ie INT 70h through 77h, including INT 0Ah), and ONLY to master PIC in case you hooked an IRQ belonging to master PIC.
In flat model, the BASE address is 0 and Limit is 0xFFFFF, with G bit(ie Granularity bit) = 1.
The base and limit(along with attribute bits(e.g G bit) of a segment) reside in the descriptor corresponding to a particular segment. The descriptor itself, sits in the descriptor table.
Descriptor tables are an array with each entry being 8bytes.
The selector is merely a pointer(or an index) to the 8-byte descriptor entry, in the Descriptor table(either GDT or LDT). So a selector CAN'T be 0.
Note that lowest 3 bits of 16-bit selector have special meaning, and only the upper 13-bits are used to index a descriptor entry from a descriptor table.
GDT = Global Descriptor Table
LDT = Local Descriptor Table
A system can have only one GDT, but many LDTs.
As entry number 0 in GDT, is reserved and can't be used. AFAIK, DOS32A, does not create any LDT for its applications, instead it simply allocate and initalize descriptor entries corresponding to the application, in GDT itself.
Selector MUST not be 0, as x86 architecture regards 0 selector as invalid, when you try to access memory with that selector; though you can successfully place 0 in any segment register, it is only when you try to access(read/write/execute) that segment, the cpu generates an exception.
In case of interrupt handlers, the base address need not be 0, even in case of flat mode.
The DPMI environment must have valid reasons for doing this so.
After all, you still need to tackle segmentation at some level in x86 architecture.
PCI device config register 0x5 bit2(Interrupt Disabled) = 0
PCI device config register 0x6 bit3(Interrupt status) = 1
I think, you mean Bus master command and status registers respectively. They actually reside in either I/O space or memory space, but NOT in PCI configuration space.
So you can read/write them directly via IN/OUT or MOV, instructions.
For reading/writing, PCI configuration registers you must use configuration red/write methods or PCI BIOS routines.
NOTE:
Many PCI disk controllers, have a bit called 'Interrupt enable/disable' bit. The register
that contains this bit is usually in the PCI configuration space, and can be found from the datasheet.
Actually, this setting is for "forwarding" the interrupt generated by the device attached to the PCI controller, to the PCI bus.
If, interrupts are disabled via this bit, then even if your device(attached to PCI controller) is generating the interrupt, the interrupt will NOT be forwarded to the PCI bus(and hence cpu will never know if interrupt occurred), but the interrupt bit(This bit is different from 'Interrupt enable/disable' bit) in PCI controller is still set to notify that the device(attached to PCI controller, eg a hard disk) generated an interrupt, so that the program can read this bit and take appropriate actions. It is similar to polling, from programming perspective.
This usually apply only for non-bus master transfers.
But, it seems that you are using bus master transfers(ie DMA), so it should not apply in your case.
But anyway, I would suggest you do read the datasheet of the PCI controller carefully, especially looking for bits/registers related to interrupt handling
EDITED:
Well, as far as application level programming is concerned, you need not encounter/use _far pointers, as your program will not access anything outside to your code.
But this is not completely true, when you go to system-level programming, you need to access memory mapped device registers, external ROM, or implementing interrupt handlers, etc.
The story changes here. The creation of a segment ie allocating descriptor and getting its associated selector, ensures that even if there is a bug in code, it will not annoyingly change anything external to that particular segment from which current code is executing. If it tries to do so, cpu will generate a fault. So when accessing external devices(especially memory mapped device's registers), or accessing some rom data, eg BIOS etc., it is a good idea to have allocate a descriptor and set the base and segment limits according to the area you need to execute/read/write and proceed. But you are not bound to do so.
Some external code residing for eg in rom, assume that they will be invoked with a far call.
As I said earlier, in x86 architecture, at some level(the farther below you go) you need to deal with segmentation as there is no way to disable it completely.
But in flat model, segmentation is present as an aid to programmer, as I said above, when accessing external(wrt to your program) things. But you need not use if you don't desire to do so.
When an interrupt handler is invoked, it doesn't know the base and limits of program that was interrupted. It doesn't know the segment attributes, limits etc. of the interrupted program, we say except CS and EIP all registers are in undefined state wrt interrupt handler. So it is needed to be declared as far function to indicate that it resides somewhere external to currently executing program.
it's been a while since I fiddled with interrupts, but the table is a pointer to set where the processor should go to to process an interrupt. I can give you the process, but not code, as I only ever used 8086 code.
Pseudo code:
Initialize:
Get current vector - store value
Set vector to point to the entry point of your routine
next:
Process Interrupt:
Your code decides what to do with data
If it's your data:
process it, and return
If not:
jump to the stored vector that we got during initialize,
and let the chain of interrupts continue as they normally would
finally:
Program End:
check to see if interrupt still points to your code
if yes, set vector back to the saved value
if no, set beginning of your code to long jump to vector address you saved,
or set a flag that lets your program not process anything

spin_lock_irqsave vs spin_lock_irq

On an SMP machine we must use spin_lock_irqsave and not spin_lock_irq from interrupt context.
Why would we want to save the flags (which contain the IF)?
Is there another interrupt routine that could interrupt us?
spin_lock_irqsave is basically used to save the interrupt state before taking the spin lock, this is because spin lock disables the interrupt, when the lock is taken in interrupt context, and re-enables it when while unlocking. The interrupt state is saved so that it should reinstate the interrupts again.
Example:
Lets say interrupt x was disabled before spin lock was acquired
spin_lock_irq will disable the interrupt x and take the the lock
spin_unlock_irq will enable the interrupt x.
So in the 3rd step above after releasing the lock we will have interrupt x enabled which was earlier disabled before the lock was acquired.
So only when you are sure that interrupts are not disabled only then you should spin_lock_irq otherwise you should always use spin_lock_irqsave.
If interrupts are already disabled before your code starts locking, when you call spin_unlock_irq you will forcibly re-enable interrupts in a potentially unwanted manner. If instead you also save the current interrupt enable state in flags through spin_lock_irqsave, attempting to re-enable interrupts with the same flags after releasing the lock, the function will just restore the previous state (thus not necessarily enabling interrupts).
Example with spin_lock_irqsave:
spinlock_t mLock = SPIN_LOCK_UNLOCK;
unsigned long flags;
spin_lock_irqsave(&mLock, flags); // Save the state of interrupt enable in flags and then disable interrupts
// Critical section
spin_unlock_irqrestore(&mLock, flags); // Return to the previous state saved in flags
Example with spin_lock_irq( without irqsave ):
spinlock_t mLock = SPIN_LOCK_UNLOCK;
unsigned long flags;
spin_lock_irq(&mLock); // Does not know if interrupts are already disabled
// Critical section
spin_unlock_irq(&mLock); // Could result in an unwanted interrupt re-enable...
The need for spin_lock_irqsave besides spin_lock_irq is quite similar to the reason local_irq_save(flags) is needed besides local_irq_disable. Here is a good explanation of this requirement taken from Linux Kernel Development Second Edition by Robert Love.
The local_irq_disable() routine is dangerous if interrupts were
already disabled prior to its invocation. The corresponding call to
local_irq_enable() unconditionally enables interrupts, despite the
fact that they were off to begin with. Instead, a mechanism is needed
to restore interrupts to a previous state. This is a common concern
because a given code path in the kernel can be reached both with and
without interrupts enabled, depending on the call chain. For example,
imagine the previous code snippet is part of a larger function.
Imagine that this function is called by two other functions, one which
disables interrupts and one which does not. Because it is becoming
harder as the kernel grows in size and complexity to know all the code
paths leading up to a function, it is much safer to save the state of
the interrupt system before disabling it. Then, when you are ready to
reenable interrupts, you simply restore them to their original state:
unsigned long flags;
local_irq_save(flags); /* interrupts are now disabled */ /* ... */
local_irq_restore(flags); /* interrupts are restored to their previous
state */
Note that these methods are implemented at least in part as macros, so
the flags parameter (which must be defined as an unsigned long) is
seemingly passed by value. This parameter contains
architecture-specific data containing the state of the interrupt
systems. Because at least one supported architecture incorporates
stack information into the value (ahem, SPARC), flags cannot be passed
to another function (specifically, it must remain on the same stack
frame). For this reason, the call to save and the call to restore
interrupts must occur in the same function.
All the previous functions can be called from both interrupt and
process context.
Reading Why kernel code/thread executing in interrupt context cannot sleep? which links to Robert Loves article, I read this :
some interrupt handlers (known in
Linux as fast interrupt handlers) run
with all interrupts on the local
processor disabled. This is done to
ensure that the interrupt handler runs
without interruption, as quickly as
possible. More so, all interrupt
handlers run with their current
interrupt line disabled on all
processors. This ensures that two
interrupt handlers for the same
interrupt line do not run
concurrently. It also prevents device
driver writers from having to handle
recursive interrupts, which complicate
programming.
Below is part of code in linux kernel 4.15.18, which shows that spiin_lock_irq() will call __raw_spin_lock_irq(). However, it will not save any flags as you can see below part of the code but disable the interrupt.
static inline void __raw_spin_lock_irq(raw_spinlock_t *lock)
{
local_irq_disable();
preempt_disable();
spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
}
Below code shows spin_lock_irqsave() which saves the current stage of flag and then preempt disable.
static inline unsigned long __raw_spin_lock_irqsave(raw_spinlock_t *lock)
{
unsigned long flags;
local_irq_save(flags);
preempt_disable();
spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
/*
* On lockdep we dont want the hand-coded irq-enable of
* do_raw_spin_lock_flags() code, because lockdep assumes
* that interrupts are not re-enabled during lock-acquire:
*/
#ifdef CONFIG_LOCKDEP
LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
#else
do_raw_spin_lock_flags(lock, &flags);
#endif
return flags;
}
This question starts from the false assertion:
On an SMP machine we must use spin_lock_irqsave and not spin_lock_irq from interrupt context.
Neither of these should be used from interrupt
context, on SMP or on UP. That said, spin_lock_irqsave()
may be used from interrupt context, as being more universal
(it can be used in both interrupt and normal contexts), but
you are supposed to use spin_lock() from interrupt context,
and spin_lock_irq() or spin_lock_irqsave() from normal context.
The use of spin_lock_irq() is almost always the wrong thing
to do in interrupt context, being this SMP or UP. It may work
because most interrupt handlers run with IRQs locally enabled,
but you shouldn't try that.
UPDATE: as some people misread this answer, let me clarify that
it only explains what is for and what is not for an interrupt
context locking. There is no claim here that spin_lock() should
only be used in interrupt context. It can be used in a process
context too, for example if there is no need to lock in interrupt
context.

Device Driver IRQL and Thread/Context Switches

I'm new to Windows device driver programming. I know that certain operations can only be performed at IRQL PASSIVE_LEVEL. For example, Microsoft have this sample code of how to write to a file from a kernel driver:
if (KeGetCurrentIrql() != PASSIVE_LEVEL)
return STATUS_INVALID_DEVICE_STATE;
Status = ZwCreateFile(...);
My question is this: What is preventing the IRQL from being raised after the KeGetCurrentIrql() check above? Say a context or thread swithch occurs, couldn't the IRQL suddenly be DISPATCH_LEVEL when it gets back to my driver which would then result in a system crash?
If this is NOT possible then why not just check the IRQL in the DriverEntry function and be done with it once for all?
The irql of a thread can only be raised by itself.
Because you are called from upper/lower drivers, the irql of the current running context may be different. And there are a couple of functions that raise/lower the irql.
A couple examples :
IRP_MJ_READ
NTSTATUS DispatchRead(
__in struct _DEVICE_OBJECT *DeviceObject,
__in struct _IRP *Irp
)
{
// this will be called at irql == PASSIVE_LEVEL
...
// we have acquire a spinlock
KSSPIN_LOCK lck;
KeInititializeSpinLock( &lck );
KIRQL prev_irql;
KeAcquireSpinLock( &lck,&prev_irql );
// KeGetCurrentIrql() == DISPATCH_LEVEL
KeReleaseSpinLock( &lck, prev_irql );
// KeGetCurrentIrql() == PASSIVE_LEVEL
...
}
(Io-)Completion routines may be called at DISPATCH_LEVEL and so should behave accordingly.
NTSTATUS CompleteSth(IN PDEVICE_OBJECT DeviceObject,IN PIRP Irp,IN PVOID Context)
{
// KeGetCurrentIrql() >= PASSIVE_LEVEL
}
The IRQL can only change in any meaningful way under your control by setting it. There are two "thread specific" IRQLs - PASSIVE_LEVEL and APC_LEVEL. You control going in and out of these levels with things like fast mutexes, and a context switch to your thread will always leave you at the level you were in before. Above that are "processor specific" IRQLs. That is DISPATCH_LEVEL or above. In these levels a context switch cannot occur. You get into these levels using spin locks and such. ISRs will occur at higher IRQLs on your thread, but you can't see them. When they return control to you your IRQL is restored.
DriverEntry is also called at PASSIVE_LEVEL.
If you want to have a job done at PASSIVE_LEVEL then use functions like IoQueueWorkItem

Resources