Although grabbing one processor core by a specific thread is not a good thing, I would like to make one kernel thread runs on a particular core without preemption and interrupts to get performance benefit for a while (same as we pin user process to one core)
After my kernel driver is loaded and when the kernel function is invoked through IOCTL, it disables local interrupts using local_irq_disable() function. What I expected was because interrupt for that specific core has been disabled, kernel thread should not be scheduled out from that core and it should run until local_irq_enble() function is invoked.
However, after a few seconds, kernel watchdog is invoked and machine is rebooted automatically with some watchdog messages. Although I pinned my thread to logical processor 3, watch dog prints that kernel cannot be rebooted because of another logical core is stuck, for example 4 not 3.
How the kernel thread can be scheduled out even though the local interrupt for that processor has been disabled? Is there any method to disable interrupts and preemption for that specific core for a while? And if the watchdog is problem which watchdog has to be turned off? watchdog? nmi-watchdog?
static long unlocked_ioctl(struct file *filp, unsigned int cmd, unsigned long argp)
{
int cpuNum = 0;
switch (cmd) {
case MON:
printk("Monitoring starts\n");
cpuNum = smp_processor_id();
printk("Monitor starts to run at core %d\n", cpuNum);
local_irq_disable();
while(1);
break;
default:
return -EINVAL;
break;
}
return 0;
}
Instead of putting some logic in the while loop, I intentionally make it run inside an infinite loop. Let's assume that the current core running this ioctl function in my driver runs at 4 at the beginning, but after few seconds, kernel starts to rebooted automatically with watchdog's message such as processor 5 not 4 doesn't responds.
Related
From the kernel I can call local_irq_disable(). To my understanding it will disable the interrupts of the current CPU. And interrupts will remain disabled until I call local_irq_enable(). Please correct me if my understanding is incorrect.
If my understanding is correct, does it mean upon calling local_irq_disable() interrupt is also disabled for a process in the user space that is running on that same CPU?
More details:
I have a process running in the user space which I want to run without affected by interrupts and context switch. As it is not possible from the user space, I thought disabling interrupt and kernel preemption for a particular CPU from kernel will help in this case. Therefore, I wrote a simple device driver to disable kernel preemption and local interrupt by using the following code,
int i = irqs_disabled();
pr_info("before interrupt disable: %d\n", i);
pr_info("module is loaded on processor: %d\n", smp_processor_id());
id = get_cpu();
message[1] = smp_processor_id() + '0';
local_irq_disable();
printk(KERN_INFO " Current CPU id is %c\n", message[1]);
printk(KERN_INFO " local_irq_disable() called, Disable local interrupts\n");
pr_info("After interrupt disable: %d\n", irqs_disabled());
output: $dmesg
[22690.997561] before interrupt disable: 0
[22690.997564] Current CPU id is 1
[22690.997565] local_irq_disable() called, Disable local interrupts
[22690.997566] After interrupt disable: 1
I think the output confirms that local_irq_disable() does disable local interrupts.
After I disable the kernel preemption and interrupts, In the userspace I use CPU_SET() to pin my process into that particular CPU. But after doing all these I'm still not getting the desired outcome. So, it seems like disabling interrupt of a particular CPU from kernel also disable interrupts for a user space process running on that CPU is not true. I'm confused.
I was looking for an answer to the above question but could not get any suitable answer.
Duration of the CPU state with disabled interrupts should be short, because it affects the whole OS. For that reason allowing user space code to be run with disabled interrupts is considered as bad practice and is not supported by the Linux kernel.
It is responsibility of the kernel module to wrap by local_irq_disable / local_irq_enable only the kernel code. Sometimes the kernel itself could "fix" incorrect usage of these functions, but that fact shouldn't be relied upon when write a module.
I have a process running in the userspace which I want to run without affected by interrupts and context switch.
Protection from the context switch could be achieved by proper setting of scheduling policy, affinity and priority of the process. That way, the scheduler will never attempt to reschedule your process. There are several questions on Stack Overflow about making a CPU to be exclusive for a selected process.
As for interrupts, they shouldn't be disabled for a user code.
If user code accesses some hardware which should have interrupts disabled, then consider moving your code into the kernel space.
If even rare interrupts badly affect on the performance of your process or its timing, then try to reconfigure Linux kernel to be "more real time". There are also some boot-time configuration options, which could help in further reducing number of interrupts on a specific core(s). See e.g. that question: Why does using taskset to run a multi-threaded Linux program on a set of isolated cores cause all threads to run on one core?.
Note, that Linux kernel is not a base for real-time OS and never intended to be. So, if no configuration and boot settings could help you, consider to choose for your application another OS, which is real time.
I'm trying to disable all interrupts including NMI's on a single core in a processor and put that core into an infinite loop with a JMP instruction targeting itself (bytecode 0xEBFE) I tried this with the following machine code:
cli
in al, 0x70
mov bl, 0x80
or al, bl
out 0x70, al
jmp self (0xEBFE)
I assumed that disabling NMI interrupts would also disable the watchdog since according to this link the watchdog timer is an NMI interrupt, but what happened when I ran this code is after around 5 seconds my computer bugchecked with code 0x101 CLOCK_WATCHDOG_TIMEOUT. I'm wondering if windows notices that I've disabled NMI interrupts and then re-enables them before initiating the kernel panic. Does anyone know how to disable the watchdog timer in windows 7?
I don't think the NMIs are at fault here.
External NMIs are obsolete, they are hard to route in an SMP system. That watchdog timer is also obsolete, it was either a secondary PIT or a limited fourth channel of the primary PIT:
----------P00440047--------------------------
PORT 0044-0047 - Microchannel - PROGRAMMABLE INTERVAL TIMER 2
SeeAlso: PORT 0040h,PORT 0048h
0044 RW PIT counter 3 (PS/2)
used as fail-safe timer. generates an NMI on time out.
for user generated NMI see at 0462.
0047 -W PIT control word register counter 3 (PS/2, EISA)
bit 7-6 = 00 counter 3 select
= 01 reserved
= 10 reserved
= 11 reserved
bit 5-4 = 00 counter latch command counter 3
= 01 read/write counter bits 0-7 only
= 1x reserved
bit 3-0 = 00
----------P0048004B--------------------------
PORT 0048-004B - EISA - PROGRAMMABLE INTERVAL TIMER 2
Note: this second timer is also supported by many Intel chipsets
SeeAlso: PORT 0040h,PORT 0044h
0048 RW EISA PIT2 counter 3 (Watchdog Timer)
0049 ?? EISA 8254 timer 2, not used (counter 4)
004A RW EISA PIT2 counter 5 (CPU speed control)
004B -W EISA PIT2 control word
These hardware is gone, it's not present on modern systems. I've tested my machine and I don't have it.
Intel chipsets don't have it:
There is only the primary PIT.
Modern timers are the LAPIC timer and the HPET (Linux did even resort to using the PMC registers).
Windows does support an HW WDT, in fact Microsoft went as long as defining an ACPI extension: the WDAT table.
This WDT however can only reboot or shutdown the system, in hardware, without the intervention of the software.
// Configures the watchdog hardware to perform a reboot
// when it is fired.
//
#define WATCHDOG_ACTION_SET_REBOOT 0x11
//
// Determines if the watchdog hardware is configured to perform
// a system shutdown when fired.
//
#define WATCHDOG_ACTION_QUERY_SHUTDOWN 0x12
//
// Configures the watchdog hardware to perform a system shutdown
// when fired.
//
#define WATCHDOG_ACTION_SET_SHUTDOWN 0x13
Microsoft set quite a quit of requirement for this WDT since it must be setup as early as possible in the boot process, before the PnP enumeration (i.e. PCI(e) enumeration).
This is not the timer that bugchecked your system.
By the way, I don't have this timer (my system is missing the WDAT table) and I don't expect it to be found on client hardware.
The bugcheck 0x101 is due to a software WDT, it is raised inside a function in ntoskrnl.exe.
This function is called by KeUpdateRunTime and by another chain of calls starting in DriverEntry:
According to Windows Internals, KeUpdateRunTime is used to update the internal ticks counting of Windows.
I'd expect only a single logical processor to be put in charge of that, though I'm not sure of how exactly Windows housekeeps time.
I'd also expect this software WDT to be implemented in a master-slave fashion: each CPU increments its own counter and a designed CPU check the counters periodically (or any equivalent implementation).
This seems to be suggested by the wording of the documentation of the 0x101 bugcheck:
The CLOCK_WATCHDOG_TIMEOUT bug check has a value of 0x00000101. This indicates that an expected clock interrupt on a secondary processor, in a multi-processor system, was not received within the allocated interval.
Again, I'm not an expert on this part of Windows (The user MdRm, probably is) and this may be utterly wrong, but if it isn't you probably are better of following Alex's advice and boot with one less logical CPU.
You can then execute code on that CPU with an INIT-SIPI-SIPI sequence as described on the Intel's manual but you must be careful because the issuing processor is using paging while the sleeping one is not yet (the processor will start up in real-mode).
Initialising a CPU may be a little cumbersome but not too much after all.
Stealing it may result in other problems besides the WDT, for example if Windows has routed an interrupt to that processor only.
I don't know if there is driver API to unregister a logical processor, I found nothing looking at the exports of hal.dll and ntoskrnl.exe.
I read this article http://www.linuxjournal.com/article/5833 to learn about spinlock. I try this to use it in my kernel driver.
Here is what my driver code needs to do:
In f1(), it will get the spin lock, and caller can call f2() will wait for the lock since the spin lock is not being unlock. The spin lock will be unlock in my interrupt handler (triggered by the HW).
void f1() {
spin_lock(&mylock);
// write hardware
REG_ADDR += FLAG_A;
}
void f2() {
spin_lock(&mylock);
//...
}
The hardware will send the application an interrupt and my interrupt handler will call spin_unlock(&mylock);
My question is if I call
f1()
f2() // i want this to block until the interrupt return saying setting REG_ADDR is done.
when I run this, I get an exception in kernel saying a deadlock " INFO: possible recursive locking detected"
How can I re-write my code so that kernel does not think I have a deadlock?
I want my driver code to wait until HW sends me an interrupt saying setting REG_ADDR is done.
Thank you.
First, since you'll be expecting to block while waiting for the interrupt, you shouldn't be using spinlocks to lock the hardware as you'll probably be holding the lock for a long time. Using a spinlock in this case will waste a lot of CPU cycles if that function is called frequently.
I would first use a mutex to lock access to the hardware register in question so other kernel threads can't simultaneously modify the register. A mutex is allowed to sleep so if it can't acquire the lock, the thread is able to go to sleep until it can.
Then, I'd use a wait queue to block the thread until the interrupt arrives and signals that the bit has finished setting.
Also, as an aside, I noticed you're trying to access your peripheral by using the following expression REG_ADDR += FLAG_A;. In the kernel, that's not the correct way to do it. It may seem to work but will break on some architectures. You should be using the read{b,w,l} and write{b,w,l} macros like
unsigned long reg;
reg = readl(REG_ADDR);
reg |= FLAG_A;
writel(reg, REG_ADDR);
where REG_ADDR is an address you obtained from ioremap.
I will agree with Michael that Spinlock, Semaphores, Mutex ( Or any other Locking Mechanisms) must be used when any of the resources(Memory/variable/piece of code) has the probability of getting shared among the kernel/user threads.
Instead of using any of the Locking primitives available I would suggest using other sleeping functionalities available in kernel like wait_event_interruptibleand wake_up. They are simple and easy to exploit them into your code. You can find its details and exploitation on net.
In ubuntu10.04 linux kernel if I insmod a module which runs
while(1);
in init_module part, entire system stops.
However, if I load a sys file in Windows 7
which runs while(1); in DriverEntry part,
system gets slow but still works.
can someone explain me why two system differs
and what is happening inside kernel?...
I think in first case(infinite loop in init_module),
there is no reason the system stops. because
even if I make while(1); in init_module, it is running
in context of insmod user application program.
so the flow infinite loop has to be scheduled by hardware interrupt signal.
This is just my opinion, I want to know the details if I am wrong...
init_module() is a system call, it runs in kernel space and not in user space.
From what you have observed, it looks like the NT kernel performs module initialization in parallel, whereas the Linux kernel does it sequentially. It might have to do with their respective architectures, NT being a hybrid kernel and Linux being monolithic.
Adding to Frédéric's answer: on Windows the DriverEntry function runs at IRQL PASSIVE_LEVEL (same as virtually all user mode code, all if we exclude APCs). Which means that it can be interrupted by any code running at a higher IRQL at any point. So what you probably encounter here is that the thread that goes into the infinite loop is still being scheduled (thus consuming CPU time), but due to its (low) IRQL it isn't able to starve the system threads or much of the other code that is running. It will, however, be able to starve user mode threads. The effect can be anything from a slowdown to a perceived hanging system.
On an SMP machine we must use spin_lock_irqsave and not spin_lock_irq from interrupt context.
Why would we want to save the flags (which contain the IF)?
Is there another interrupt routine that could interrupt us?
spin_lock_irqsave is basically used to save the interrupt state before taking the spin lock, this is because spin lock disables the interrupt, when the lock is taken in interrupt context, and re-enables it when while unlocking. The interrupt state is saved so that it should reinstate the interrupts again.
Example:
Lets say interrupt x was disabled before spin lock was acquired
spin_lock_irq will disable the interrupt x and take the the lock
spin_unlock_irq will enable the interrupt x.
So in the 3rd step above after releasing the lock we will have interrupt x enabled which was earlier disabled before the lock was acquired.
So only when you are sure that interrupts are not disabled only then you should spin_lock_irq otherwise you should always use spin_lock_irqsave.
If interrupts are already disabled before your code starts locking, when you call spin_unlock_irq you will forcibly re-enable interrupts in a potentially unwanted manner. If instead you also save the current interrupt enable state in flags through spin_lock_irqsave, attempting to re-enable interrupts with the same flags after releasing the lock, the function will just restore the previous state (thus not necessarily enabling interrupts).
Example with spin_lock_irqsave:
spinlock_t mLock = SPIN_LOCK_UNLOCK;
unsigned long flags;
spin_lock_irqsave(&mLock, flags); // Save the state of interrupt enable in flags and then disable interrupts
// Critical section
spin_unlock_irqrestore(&mLock, flags); // Return to the previous state saved in flags
Example with spin_lock_irq( without irqsave ):
spinlock_t mLock = SPIN_LOCK_UNLOCK;
unsigned long flags;
spin_lock_irq(&mLock); // Does not know if interrupts are already disabled
// Critical section
spin_unlock_irq(&mLock); // Could result in an unwanted interrupt re-enable...
The need for spin_lock_irqsave besides spin_lock_irq is quite similar to the reason local_irq_save(flags) is needed besides local_irq_disable. Here is a good explanation of this requirement taken from Linux Kernel Development Second Edition by Robert Love.
The local_irq_disable() routine is dangerous if interrupts were
already disabled prior to its invocation. The corresponding call to
local_irq_enable() unconditionally enables interrupts, despite the
fact that they were off to begin with. Instead, a mechanism is needed
to restore interrupts to a previous state. This is a common concern
because a given code path in the kernel can be reached both with and
without interrupts enabled, depending on the call chain. For example,
imagine the previous code snippet is part of a larger function.
Imagine that this function is called by two other functions, one which
disables interrupts and one which does not. Because it is becoming
harder as the kernel grows in size and complexity to know all the code
paths leading up to a function, it is much safer to save the state of
the interrupt system before disabling it. Then, when you are ready to
reenable interrupts, you simply restore them to their original state:
unsigned long flags;
local_irq_save(flags); /* interrupts are now disabled */ /* ... */
local_irq_restore(flags); /* interrupts are restored to their previous
state */
Note that these methods are implemented at least in part as macros, so
the flags parameter (which must be defined as an unsigned long) is
seemingly passed by value. This parameter contains
architecture-specific data containing the state of the interrupt
systems. Because at least one supported architecture incorporates
stack information into the value (ahem, SPARC), flags cannot be passed
to another function (specifically, it must remain on the same stack
frame). For this reason, the call to save and the call to restore
interrupts must occur in the same function.
All the previous functions can be called from both interrupt and
process context.
Reading Why kernel code/thread executing in interrupt context cannot sleep? which links to Robert Loves article, I read this :
some interrupt handlers (known in
Linux as fast interrupt handlers) run
with all interrupts on the local
processor disabled. This is done to
ensure that the interrupt handler runs
without interruption, as quickly as
possible. More so, all interrupt
handlers run with their current
interrupt line disabled on all
processors. This ensures that two
interrupt handlers for the same
interrupt line do not run
concurrently. It also prevents device
driver writers from having to handle
recursive interrupts, which complicate
programming.
Below is part of code in linux kernel 4.15.18, which shows that spiin_lock_irq() will call __raw_spin_lock_irq(). However, it will not save any flags as you can see below part of the code but disable the interrupt.
static inline void __raw_spin_lock_irq(raw_spinlock_t *lock)
{
local_irq_disable();
preempt_disable();
spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
}
Below code shows spin_lock_irqsave() which saves the current stage of flag and then preempt disable.
static inline unsigned long __raw_spin_lock_irqsave(raw_spinlock_t *lock)
{
unsigned long flags;
local_irq_save(flags);
preempt_disable();
spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
/*
* On lockdep we dont want the hand-coded irq-enable of
* do_raw_spin_lock_flags() code, because lockdep assumes
* that interrupts are not re-enabled during lock-acquire:
*/
#ifdef CONFIG_LOCKDEP
LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
#else
do_raw_spin_lock_flags(lock, &flags);
#endif
return flags;
}
This question starts from the false assertion:
On an SMP machine we must use spin_lock_irqsave and not spin_lock_irq from interrupt context.
Neither of these should be used from interrupt
context, on SMP or on UP. That said, spin_lock_irqsave()
may be used from interrupt context, as being more universal
(it can be used in both interrupt and normal contexts), but
you are supposed to use spin_lock() from interrupt context,
and spin_lock_irq() or spin_lock_irqsave() from normal context.
The use of spin_lock_irq() is almost always the wrong thing
to do in interrupt context, being this SMP or UP. It may work
because most interrupt handlers run with IRQs locally enabled,
but you shouldn't try that.
UPDATE: as some people misread this answer, let me clarify that
it only explains what is for and what is not for an interrupt
context locking. There is no claim here that spin_lock() should
only be used in interrupt context. It can be used in a process
context too, for example if there is no need to lock in interrupt
context.