What is the best way to trace interrupts all the way from ring3 to ring0?
For example, for the clock interrupt:
I want to see all the called functions starting from function in the interrupted user-mode process down to scheduler_tick().
I can do it manually by running gdb with QEMU, however it's quite cumbersome.
Perhaps ftrace is what you want.
It allows you to keep track of the kernel function calls. You have to manually set all the functions that you want to trace. Then, the kernel will keep track of those function in a buffer that you can read later.
Related
I'm making an emulation driver that requires me to call schedule() in ATOMIC contexts in order to make the emulation part work. For now I have this hack that allows me to call schedule() inside ATOMIC (e.g. spinlock) context:
int p_count = current_thread_info()->preempt_count;
current_thread_info()->preempt_count = 0;
schedule();
current_thread_info()->preempt_count = p_count;
But that doesn't work inside IRQs, the system just stops afer calling schedule().
Is there any way to hack the kernel in a way to allow me to do it? I'm using Linux kernel 4.2.1 with User Mode Linux
In kernel code you can be either in interrupt context or in process context.
When you are in interrupt context, you cannot call any blocking function (e.g., schedule()) or access the current pointer. That's related to how the kernel is designed and there is no way for having such functionalities in interrupt context. See also this answer.
Depending on what is your purpose, you can find some strategy that allows you to reach your goal. To me, it sounds strange that you have to call schedule() explicitly instead of relying on the natural kernel flow.
One possible approach follows (but, again, it depends on your specific goal). Form the IRQ you can schedule the work on a work queue through schedule_work(). The work queue, in fact, by design, executes kernel code in process context. From there, you are allowed to call blocking functions and access the current process data.
When we use irq_set_chained_handler the irq line will not be disabled or not, when we are servicing the associated handler, as in case of request_irq.
It doesn't matter how the interrupt was setup. When any interrupt occurred, all interrupts (for this CPU) will be disabled during the interrupt handler. For example, on ARM architecture first place in C code where interrupt handling is found is asm_do_IRQ() function (defined in arch/arm/kernel/irq.c). It's being called from assembler code. For any interrupt (whether it was requested by request_irq() or by irq_set_chained_handler()) the same asm_do_IRQ() function is called, and interrupts are disabled automatically by ARM CPU. See this answer for details.
Historical notes
Also, it worth to be mentioned that some time ago Linux kernel was providing two types of interrupts: "fast" and "slow" ones. Fast interrupts (when using IRQF_DISABLED or SA_INTERRUPT flag) were running with disabled interrupts, and those handlers supposed to be very short and quick. Slow interrupts, on the other hand, were running with re-enabled interrupts, because handlers for slow interrupts may take much of time to be handled.
On modern versions of Linux kernel all interrupts are considered as "fast" and are running with interrupts disabled. Interrupts with huge handlers must be implemented as threaded (or enable interrupts manually in ISR using local_irq_enable_in_hardirq()).
That behavior was changed in Linux kernel v2.6.35 by this commit. You can find more details about this here.
Refer https://www.kernel.org/doc/Documentation/gpio/driver.txt
This means the GPIO irqchip is registered using
irq_set_chained_handler() or the corresponding
gpiochip_set_chained_irqchip() helper function, and the GPIO irqchip
handler will be called immediately from the parent irqchip, while
holding the IRQs disabled. The GPIO irqchip will then end up calling
something like this sequence in its interrupt handler:
In the Linux kernel, when a breakpoint I register with register_wide_hw_breakpoint is triggered, the callback handler endlessly runs until the breakpoint is unregistered.
Background: To test a driver for some hardware we are making, I am writing a second kernel module that emulates the hardware interface. My intent is to set a watchpoint on a memory location that in the hardware would be a control register, so that writing to this 'register' can trigger an operation by the emulator driver.
See here for a complete sample.
I set the breakpoint as follows:
hw_breakpoint_init(&attr);
attr.bp_addr = kallsyms_lookup_name("test_value");
attr.bp_len = HW_BREAKPOINT_LEN_4;
attr.bp_type = HW_BREAKPOINT_W;
test_hbp = register_wide_hw_breakpoint(&attr, test_hbp_handler, NULL);
but when test_value is written to, the callback (test_hbp_handler) is triggered continually without control ever returning to the code that was about to write to test_value.
1) What should I be doing differently for this to work as expected (return execution to code that triggered breakpoint)?
2) How do I capture the value that was being written to the memory location?
In case this matters:
$ uname -a
Linux socfpga-cyclone5 3.10.37-ltsi-rt37-05714-ge4ee387 #1 SMP PREEMPT RT Mon Jan 5 17:51:35 UTC 2015 armv7l GNU/Linux
This is by design. When an ARM hardware watchpoint is hit, it generates a Data Abort exception. On ARM, Data Abort exceptions trigger before the instruction that triggers them finishes1. This means that, in the exception handler, registers and memory locations affected by the instruction still hold their old values (or, in some cases, undefined values). As such, when the handler finishes, it must retry the aborted instruction so that the interrupted program runs as intended2. If the watchpoint is still set when the handler returns, the instruction will trigger it again. This causes the loop you're seeing.
To get around this, userspace debuggers like GDB single-step over any instruction that hits a watchpoint with that watchpoint disabled before resuming execution. The underlying kernel API, however, just exposes the hardware watchpoint behavior directly. Since you're using the kernel API, it's up to your event handler to ensure that the watchpoint doesn't trigger on the retried instruction.
[The ARM watchpoint code in the kernel actually does support automatic single-step, but only under very specific conditions. Namely, it requires 1) that no event handler is registered to the watchpoint, 2) that the watchpoint is in userspace, and 3) that the watchpoint is not associated with a particular CPU. Since your use case violates at least (1) and (2), you have to find another solution.]3
Unfortunately, on ARM, there's no foolproof way to keep the watchpoint enabled without causing a loop. The breakpoint mode that GDB uses to single-step programs, "instruction mismatch," generates UNPREDICTABLE behaviour when used in kernel mode4. The best you can do is disable the watchpoint in your handler and then set a standard breakpoint to re-enable it on an instruction that you know will execute soon after.
For your MMIO emulation driver, watchpoints are probably not the answer. In addition to the issues just mentioned, most ARM cores have very few watchpoint registers, so the solution would not scale. I'm afraid I'm not familiar enough with ARM's memory model to suggest an alternative approach. However, Linux's existing code for emulating memory-mapped IO for virtual machines might be a good place to start.
1There are two types of Data Abort exceptions, synchronous and asynchronous, and it's left to the implementation to decide which one a watchpoint generates. I'm describing the behavior of synchronous exceptions in this answer, because that's what would cause the problem you're having.
2ARMv7-A/R Architecture Reference Manual, B1.9.8, "Data Abort exception."
3Linux Kernel v4.6, arch/arm/kernel/hw_breakpoint.c, lines 634-661.
4ARMv7-A/R Architecture Reference Manual, C3.3.3, "UNPREDICTABLE cases when Monitor debug-mode is selected."
I have an embedded board with a kernel module of thousands of lines which freeze on random and complexe use case with random time. What are the solution for me to try to debug it ?
I have already try magic System Request but it does not work. I guess that the explanation is that I am in a loop or a deadlock in a code where hardware interrupt is disable ?
Thanks,
Eva.
Typically, embedded boards have a watch dog. You should enable this timer and use the watchdog user process to kick the watch dog hard ware. Use nice on the watchdog process so that higher priority tasks must relinquish the CPU. This gives clues as to the issue. If the device does not reset with a watch dog active, then it maybe that only the network or serial port has stopped communicating. Ie, the kernel has not locked up. The issue is that there is no user visible activity. The watch dog is also useful if/when this type of issue occurs in the field.
For a kernel lockup case, the lockup watchdogs kernel features maybe useful. This will work if you have an infinite loop/deadlock as speculated. However, if this is custom hardware, it is also possible that SDRAM or a peripheral device latches up and causes abnormal bus activity. This will stop the CPU from fetching proper code; obviously, it is tough for Linux to recover from this.
You can combine the watchdog with some fallow memory that is used as a trace buffer. memmap= and mem= can limit the memory used by the kernel. A driver/device using this memory can be written that saves trace points that survive a reboot. The fallow memory's ring buffer is dumped when a watchdog reset is detected on kernel boot.
It is also useful to register thread notifiers that can do a printk on context switches, if the issue is repeatable or to discover how to make the event repeatable. Once you determine a sequence of events that leads to the lockup, you can use the scope or logic analyzer to do some final diagnosis. Or, it maybe evident which peripheral is the issue at this point.
You may also set panic=-1 and reboot=... on the kernel command line. The kdump facilities are useful, if you only have a code problem.
Related: kernel trap (at web archive). This link may no longer be available, but aren't important to this answer.
I am actually reading Windows Internals 5th edition and i am enjoying, although isn't a easy book to read and understand.
I am confused about IRQLs and IDT Table.
I read that windows implement custom priorization levels with IRQL and the Plug and Play Manager maps IRQ from devices to IRQL.
Alright, so, IRQLs are used for Software and Hardware interrupts, and for exceptions is used the Exception Dispatch handler.
When one device generates an interrupt, the interrupt controller pass this information to the CPU with the IRQ.
So Windows takes this IRQ and translates to IRQL to schedule when to execute the routine (routine that IDT[IRQ_VALUE] is pointing to?
Is that what is happening?
Yes, on a very high level.
Everything starts with a kernel trap. Kernel trap handler handles interrupts, exceptions, system service calls and virtual memory pager.
When an interrupt happens (line based - using dedicated pin or message based- writing to an address) windows uses IRQL to determine the priority of the interrupt and uses this to see if the interrupt can be served or not during that time. HAL does the job of translating the IRQ to IRQL.
It then uses IRQ to get an index of the IDT to find the appropriate ISR routing to invoke. Note there can be multiple ISR associated for a given IRQ. All of them execute in order.
Each processor has its own IDT so you could potentially have multiple ISR's running at the same time.
Exception dispatch, as I mentioned before, is also handled by the kernel trap but the procedure for it is different. It usually starts by checking for any exception handlers by stack unwinding, then checking for debugger port etc.