LITMUS^RT: How to handle linux kernel deadlock

LITMUS^RT: How to handle linux kernel deadlock - linux-kernel

I'm working on the linux kernel 3.10 patched with LITMUS^RT, a real-time extension with a focus on multiprocessor real-time scheduling and synchronization.
My aim is to write a scheduler that allows a task to migrate from a cpu to another when preempted and only when particular conditions are met. My current implementation is affected by deadlock beetwen cpus as shown by the following error:
Setting up rt task parameters for process 1622.
[ INFO: inconsistent lock state ]
3.10.5-litmus2013.1 #105 Not tainted
---------------------------------
inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
rtspin/1620 [HC0[0]:SC0[0]:HE1:SE1] takes:
(&rq->lock){?.-.-.}, at: [<ffffffff8155f0d5>] __schedule+0x175/0xa70
{IN-HARDIRQ-W} state was registered at:
[<ffffffff8107832a>] __lock_acquire+0x86a/0x1e90
[<ffffffff81079f65>] lock_acquire+0x95/0x140
[<ffffffff81560fc6>] _raw_spin_lock+0x36/0x50
[<ffffffff8105e231>] scheduler_tick+0x61/0x210
[<ffffffff8103f112>] update_process_times+0x62/0x80
[<ffffffff81071677>] tick_periodic+0x27/0x70
[<ffffffff8107174b>] tick_handle_periodic+0x1b/0x70
[<ffffffff810042d0>] timer_interrupt+0x10/0x20
[<ffffffff810849fd>] handle_irq_event_percpu+0x6d/0x260
[<ffffffff81084c33>] handle_irq_event+0x43/0x70
[<ffffffff8108778c>] handle_level_irq+0x6c/0xc0
[<ffffffff81003a89>] handle_irq+0x19/0x30
[<ffffffff81003925>] do_IRQ+0x55/0xd0
[<ffffffff81561cef>] ret_from_intr+0x0/0x13
[<ffffffff8108615a>] __setup_irq+0x20a/0x4e0
[<ffffffff81086473>] setup_irq+0x43/0x90
[<ffffffff8184fb5f>] setup_default_timer_irq+0x12/0x14
[<ffffffff8184fb78>] hpet_time_init+0x17/0x19
[<ffffffff8184fb46>] x86_late_time_init+0xa/0x11
[<ffffffff8184ecd1>] start_kernel+0x270/0x2e0
[<ffffffff8184e5a3>] x86_64_start_reservations+0x2a/0x2c
[<ffffffff8184e66c>] x86_64_start_kernel+0xc7/0xca
irq event stamp: 8886
hardirqs last enabled at (8885): [<ffffffff8108dd6b>] rcu_note_context_switch+0x8b/0x2d0
hardirqs last disabled at (8886): [<ffffffff81561052>] _raw_spin_lock_irq+0x12/0x50
softirqs last enabled at (8880): [<ffffffff81037125>] __do_softirq+0x195/0x2b0
softirqs last disabled at (8857): [<ffffffff8103738d>] irq_exit+0x7d/0x90
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&rq->lock);
<Interrupt>
lock(&rq->lock);
*** DEADLOCK ***
1 lock held by rtspin/1620:
#0: (&rq->lock){?.-.-.}, at: [<ffffffff8155f0d5>] __schedule+0x175/0xa70
stack backtrace:
CPU: 1 PID: 1620 Comm: rtspin Not tainted 3.10.5-litmus2013.1 #105
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
ffffffff81bc4cc0 ffff88001cdf3aa8 ffffffff8155ae1e ffff88001cdf3af8
ffffffff81557f39 0000000000000000 ffff880000000001 ffff880000000001
0000000000000000 ffff88001c5ec280 ffffffff810750d0 0000000000000002
Call Trace:
[<ffffffff8155ae1e>] dump_stack+0x19/0x1b
[<ffffffff81557f39>] print_usage_bug+0x1f7/0x208
[<ffffffff810750d0>] ? print_shortest_lock_dependencies+0x1c0/0x1c0
[<ffffffff81075ead>] mark_lock+0x2ad/0x320
[<ffffffff81075fd0>] mark_held_locks+0xb0/0x120
[<ffffffff8129bf71>] ? pfp_schedule+0x691/0xba0
[<ffffffff810760f2>] trace_hardirqs_on_caller+0xb2/0x210
[<ffffffff8107625d>] trace_hardirqs_on+0xd/0x10
[<ffffffff8129bf71>] pfp_schedule+0x691/0xba0
[<ffffffff81069e70>] pick_next_task_litmus+0x40/0x500
[<ffffffff8155f17a>] __schedule+0x21a/0xa70
[<ffffffff8155f9f4>] schedule+0x24/0x70
[<ffffffff8155d1bc>] schedule_timeout+0x14c/0x200
[<ffffffff8105e3ed>] ? get_parent_ip+0xd/0x50
[<ffffffff8105e589>] ? sub_preempt_count+0x69/0xf0
[<ffffffff8155ffab>] wait_for_completion_interruptible+0xcb/0x140
[<ffffffff81060e60>] ? try_to_wake_up+0x470/0x470
[<ffffffff8129266f>] do_wait_for_ts_release+0xef/0x190
[<ffffffff81292782>] sys_wait_for_ts_release+0x22/0x30
[<ffffffff81562552>] system_call_fastpath+0x16/0x1b
At this point I suppose I see two possible approaches to follow to solve this problem:
Release the lock to the current cpu before migrating to the target cpu using the kernel's functions. LITMUS^RT provides a callback in which I can decide which task will execute:
static struct task_struct* pfp_schedule(struct task_struct * prev) {
[...]
if(is_preempted(prev) {
// release lock on current cpu
migrate_to_another_cpu();
}
[...]
}
What I think I must do is to release the current lock before the call to migrate_to_another_cpu function, but I still haven't found any way of doing that.
The scheduler that I want to implement allows only one task migration at a time, therefore it's theoretically impossible to have a deadlock. For some reasons though the execution of my task set fails and I get the error listed above which I think is identified during some kind of preliminary analisys performed by the linux kernel. This, however is a potential deadlock, it's kind of a warning, and I would like to let my task set to continue its normal execution ignoring it.
Long story short, does anyone know if one of these solutions is possible and, if yes, how to implement it?
Thanks in advance!
P.S.: tips or better ideas are very welcome! ;-)

Related

Calling local_irq_disable() from the kernel also disable local interrupts in userspace?

From the kernel I can call local_irq_disable(). To my understanding it will disable the interrupts of the current CPU. And interrupts will remain disabled until I call local_irq_enable(). Please correct me if my understanding is incorrect.
If my understanding is correct, does it mean upon calling local_irq_disable() interrupt is also disabled for a process in the user space that is running on that same CPU?
More details:
I have a process running in the user space which I want to run without affected by interrupts and context switch. As it is not possible from the user space, I thought disabling interrupt and kernel preemption for a particular CPU from kernel will help in this case. Therefore, I wrote a simple device driver to disable kernel preemption and local interrupt by using the following code,
int i = irqs_disabled();
pr_info("before interrupt disable: %d\n", i);
pr_info("module is loaded on processor: %d\n", smp_processor_id());
id = get_cpu();
message[1] = smp_processor_id() + '0';
local_irq_disable();
printk(KERN_INFO " Current CPU id is %c\n", message[1]);
printk(KERN_INFO " local_irq_disable() called, Disable local interrupts\n");
pr_info("After interrupt disable: %d\n", irqs_disabled());
output: $dmesg
[22690.997561] before interrupt disable: 0
[22690.997564] Current CPU id is 1
[22690.997565] local_irq_disable() called, Disable local interrupts
[22690.997566] After interrupt disable: 1
I think the output confirms that local_irq_disable() does disable local interrupts.
After I disable the kernel preemption and interrupts, In the userspace I use CPU_SET() to pin my process into that particular CPU. But after doing all these I'm still not getting the desired outcome. So, it seems like disabling interrupt of a particular CPU from kernel also disable interrupts for a user space process running on that CPU is not true. I'm confused.
I was looking for an answer to the above question but could not get any suitable answer.

Duration of the CPU state with disabled interrupts should be short, because it affects the whole OS. For that reason allowing user space code to be run with disabled interrupts is considered as bad practice and is not supported by the Linux kernel.
It is responsibility of the kernel module to wrap by local_irq_disable / local_irq_enable only the kernel code. Sometimes the kernel itself could "fix" incorrect usage of these functions, but that fact shouldn't be relied upon when write a module.
I have a process running in the userspace which I want to run without affected by interrupts and context switch.
Protection from the context switch could be achieved by proper setting of scheduling policy, affinity and priority of the process. That way, the scheduler will never attempt to reschedule your process. There are several questions on Stack Overflow about making a CPU to be exclusive for a selected process.
As for interrupts, they shouldn't be disabled for a user code.
If user code accesses some hardware which should have interrupts disabled, then consider moving your code into the kernel space.
If even rare interrupts badly affect on the performance of your process or its timing, then try to reconfigure Linux kernel to be "more real time". There are also some boot-time configuration options, which could help in further reducing number of interrupts on a specific core(s). See e.g. that question: Why does using taskset to run a multi-threaded Linux program on a set of isolated cores cause all threads to run on one core?.
Note, that Linux kernel is not a base for real-time OS and never intended to be. So, if no configuration and boot settings could help you, consider to choose for your application another OS, which is real time.

What happens when a task is executing critical section but it needs to be scheduled out on a uniprocessor system with preemption disabled?

Here is a scenario. Let’s say that a kernel task is running on a uniprocessor system with preemption disabled. The task acquires a spin lock. Now it is executing it’s critical section. At this time, what if the time slice available for this task expires and it has to schedule out?
Does the spin_lock have a mechanism to prevent this?
Can it be scheduled out? If yes, then what happens to the critical section?
Can it be interrupted by an IRQ? (Assuming that preemption is disabled)
Is this scenario feasible? In other words, could this scenario happen?
From the kernel code, I understand that the spin_lock is basically a nop on a uniprocessor with preemption disabled. To be accurate, all it does is barrier()
I understand why it is a nop (as it is a uniprocessor and no other task could be manipulating the data at that instant) but I still don’t understand how it could be uninterrupted(due to IRQs or scheduling).
What am I missing here? Pointers to the Linux kernel code which indicates about this could be really helpful.
My basic assumptions:
32 bit Linux kernel

Actually spin_lock() disables preemption by calling preempt_disable() before it tries to acquire the lock, so scenario #1, #2, #3 could never happen.
From recent source code, spin_lock() eventually calls __raw_spin_lock(), which calls preempt_disable() before calling spin_acquire() to acquire the lock. spin_lock_irqsave() which is commonly used in interrupt context has similar context.
Regarding #3, if the variable is shared between process/interrupt context, you should always use spin_lock_irq()/spin_lock_irqsave() instead of spin_lock() to avoid deadlock scenario.

The mechanism that handles time slices expiring is a timer interrupt. The interrupt will set the TIF_NEEDS_RESCHED flag for the process. When returning from the timer's interrupt context back to your critical section, a check will be made whether or not to preempt the process due to the TIF_NEEDS_RESCHED flag. Since preemption is disabled, nothing will happen and it will return to your critical section.
When your critical section is over, the release of the lock will call preempt_enable() to reenable preemption. At that moment another check is done as to whether or not to preempt. Since the TIF_NEEDS_RESCHED flag is set and preemption is now enabled, the process will be preempted.
Spin locks disable preemption.
No, because preemption is disabled.
Yes. There are spin lock versions that disable IRQs to prevent this.
No because spin locks disable preemption.
Spinlocks don't exist on unitprocessor systems anyway because they don't make sense. If a a thread that doesn't own the lock attempts to acquire it, that means that the thread that does own it is currently asleep (only one cpu). So there's no reason to spin wait for something that's asleep. For this reason spinlocks are optimized away in these cases to just a preemption disable so that no other thread can touch the critical section.

register_wide_hw_breakpoint continually triggers handler callback

In the Linux kernel, when a breakpoint I register with register_wide_hw_breakpoint is triggered, the callback handler endlessly runs until the breakpoint is unregistered.
Background: To test a driver for some hardware we are making, I am writing a second kernel module that emulates the hardware interface. My intent is to set a watchpoint on a memory location that in the hardware would be a control register, so that writing to this 'register' can trigger an operation by the emulator driver.
See here for a complete sample.
I set the breakpoint as follows:
hw_breakpoint_init(&attr);
attr.bp_addr = kallsyms_lookup_name("test_value");
attr.bp_len = HW_BREAKPOINT_LEN_4;
attr.bp_type = HW_BREAKPOINT_W;
test_hbp = register_wide_hw_breakpoint(&attr, test_hbp_handler, NULL);
but when test_value is written to, the callback (test_hbp_handler) is triggered continually without control ever returning to the code that was about to write to test_value.
1) What should I be doing differently for this to work as expected (return execution to code that triggered breakpoint)?
2) How do I capture the value that was being written to the memory location?
In case this matters:
$ uname -a
Linux socfpga-cyclone5 3.10.37-ltsi-rt37-05714-ge4ee387 #1 SMP PREEMPT RT Mon Jan 5 17:51:35 UTC 2015 armv7l GNU/Linux

This is by design. When an ARM hardware watchpoint is hit, it generates a Data Abort exception. On ARM, Data Abort exceptions trigger before the instruction that triggers them finishes1. This means that, in the exception handler, registers and memory locations affected by the instruction still hold their old values (or, in some cases, undefined values). As such, when the handler finishes, it must retry the aborted instruction so that the interrupted program runs as intended2. If the watchpoint is still set when the handler returns, the instruction will trigger it again. This causes the loop you're seeing.
To get around this, userspace debuggers like GDB single-step over any instruction that hits a watchpoint with that watchpoint disabled before resuming execution. The underlying kernel API, however, just exposes the hardware watchpoint behavior directly. Since you're using the kernel API, it's up to your event handler to ensure that the watchpoint doesn't trigger on the retried instruction.
[The ARM watchpoint code in the kernel actually does support automatic single-step, but only under very specific conditions. Namely, it requires 1) that no event handler is registered to the watchpoint, 2) that the watchpoint is in userspace, and 3) that the watchpoint is not associated with a particular CPU. Since your use case violates at least (1) and (2), you have to find another solution.]3
Unfortunately, on ARM, there's no foolproof way to keep the watchpoint enabled without causing a loop. The breakpoint mode that GDB uses to single-step programs, "instruction mismatch," generates UNPREDICTABLE behaviour when used in kernel mode4. The best you can do is disable the watchpoint in your handler and then set a standard breakpoint to re-enable it on an instruction that you know will execute soon after.
For your MMIO emulation driver, watchpoints are probably not the answer. In addition to the issues just mentioned, most ARM cores have very few watchpoint registers, so the solution would not scale. I'm afraid I'm not familiar enough with ARM's memory model to suggest an alternative approach. However, Linux's existing code for emulating memory-mapped IO for virtual machines might be a good place to start.
1There are two types of Data Abort exceptions, synchronous and asynchronous, and it's left to the implementation to decide which one a watchpoint generates. I'm describing the behavior of synchronous exceptions in this answer, because that's what would cause the problem you're having.
2ARMv7-A/R Architecture Reference Manual, B1.9.8, "Data Abort exception."
3Linux Kernel v4.6, arch/arm/kernel/hw_breakpoint.c, lines 634-661.
4ARMv7-A/R Architecture Reference Manual, C3.3.3, "UNPREDICTABLE cases when Monitor debug-mode is selected."

How to implement a kernel thread that never sleeps?

Problem
I need a kernel thread that is able to work for prolonged periods of time without yielding, basically fully dedicating a CPU core to it on demand:
int my_kthread(void *arg)
{
while(!kthread_should_stop()) {
do_some_work();
if(sleeping_enabled) msleep(1000);
else {
// What to do here to avoid lockup warnings
// and ensure system stability?
}
}
return 0;
}
Background
The thread is created like this when the module that I am working on is loaded:
my_task = kthread_run(&my_kthread, (void *)some_data, "My KThread")
set_cpus_allowed(my_task, *cpumask_of(10)); // Pin thread to core #10
and stopped like this when the module is unloaded:
kthread_stop(my_task);
Everything works just fine when sleeping_enabled is true.
Otherwise, soon after the thread is started, the kernel complains of the apparent lockup.
At first, I merely aimed to avoid the various warnings such as
BUG: soft lockup - CPU#10 stuck for 22s!
and
INFO: rcu_sched detected stalls on CPUs/tasks: { 10} (detected by 15, t=30567 jiffies)
since they tend to flood my console with dumps for all >20 cores, and the "lockup" is desired behavior.
I tried poking the watchdog like this:
if(sleeping_enabled) msleep(1000);
else touch_softlockup_watchdog();
in combination with (echo 1 > /sys/module/rcupdate/parameters/rcu_cpu_stall_suppress)
and pretty much got what I want (a never-sleeping thread that successfully does what I want and no spam in the console).
However, not only does this "solution" feel like cheating, it seems I am completely breaking something by hogging that one core: when unloading the module via rmmod, the whole system freezes. The console starts periodically dumping soft lockups on all cores, with this call trace:
[<ffffffff810c96b0>] ? queue_stop_cpus_work+0xd0/0xd0
[<ffffffff810c9903>] cpu_stopper_thread+0xe3/0x1b0
[<ffffffff8108639a>] ? finish_task_switch+0x4a/0xf0
[<ffffffff8169e654>] ? __schedule+0x3c4/0x700
[<ffffffff81080e98>] ? __wake_up_common+0x58/0x90
[<ffffffff810c9820>] ? __stop_cpus+0x80/0x80
[<ffffffff81077e93>] kthread+0x93/0xa0
[<ffffffff816a9724>] kernel_thread_helper+0x4/0x10
[<ffffffff81077e00>] ? flush_kthread_worker+0xb0/0xb0
[<ffffffff816a9720>] ? gs_change+0x13/0x13
Meanwhile, my kernel thread continues running (as evidenced by some console messages that it prints out every now and then), so it never saw kthread_should_stop() return true.
Unloading did work correctly and stopped the thread before I switched to not sleeping at all. Now, I am unable to make iterative modifications without having to reboot.
Note that I have simplified the description here a lot. I am trying to add such a thread (to poll some hardware registers and log their changes) to a GPU driver, so there may be module-dependent reasons for the freeze on unload. However, this does not change my general question about how to best implement a thread that never sleeps.

I think this question is similar to your question "whole one core dedicated to single process" , please check the replies there.

Kernel freeze : How to debug it?

I have an embedded board with a kernel module of thousands of lines which freeze on random and complexe use case with random time. What are the solution for me to try to debug it ?
I have already try magic System Request but it does not work. I guess that the explanation is that I am in a loop or a deadlock in a code where hardware interrupt is disable ?
Thanks,
Eva.

Typically, embedded boards have a watch dog. You should enable this timer and use the watchdog user process to kick the watch dog hard ware. Use nice on the watchdog process so that higher priority tasks must relinquish the CPU. This gives clues as to the issue. If the device does not reset with a watch dog active, then it maybe that only the network or serial port has stopped communicating. Ie, the kernel has not locked up. The issue is that there is no user visible activity. The watch dog is also useful if/when this type of issue occurs in the field.
For a kernel lockup case, the lockup watchdogs kernel features maybe useful. This will work if you have an infinite loop/deadlock as speculated. However, if this is custom hardware, it is also possible that SDRAM or a peripheral device latches up and causes abnormal bus activity. This will stop the CPU from fetching proper code; obviously, it is tough for Linux to recover from this.
You can combine the watchdog with some fallow memory that is used as a trace buffer. memmap= and mem= can limit the memory used by the kernel. A driver/device using this memory can be written that saves trace points that survive a reboot. The fallow memory's ring buffer is dumped when a watchdog reset is detected on kernel boot.
It is also useful to register thread notifiers that can do a printk on context switches, if the issue is repeatable or to discover how to make the event repeatable. Once you determine a sequence of events that leads to the lockup, you can use the scope or logic analyzer to do some final diagnosis. Or, it maybe evident which peripheral is the issue at this point.
You may also set panic=-1 and reboot=... on the kernel command line. The kdump facilities are useful, if you only have a code problem.
Related: kernel trap (at web archive). This link may no longer be available, but aren't important to this answer.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio