How to determine which task is dead? - debugging

I have an embedded system that has multiple (>20) tasks running at different priorities. I also have watchdog task that runs to check that all the other tasks are not stuck. My watchdog is working because every once in a blue moon, it will reboot the system because a task did not check in.
How do I determine which task died?
I can't just blame the oldest task to kick the watchdog because it might have been held off by a higher priority task that is not yielding.
Any suggestions?

A per-task watchdog requires that the higher priority tasks yield for an adequate time so that all may kick the watchdog. To determine which task is at fault, you'll have to find the one that's starving the others. You'll need to measure task execution times between watchdog checks to locate the actual culprit.

Is this pre-emptive? I gather so since otherwise a watchdog task would not run if one of the others had gotten stuck.
You make no mention of the OS but, if a watchdog task can check if a single task has not checked in, there must be separate channels of communication between each task and the watchdog.
You'll probably have to modify the watchdog to somehow dump the task number of the one that hasn't checked in and dump the task control blocks and memory so you can do a post-mortem.
Depending on the OS, this could be easy or hard.

Even I was working last few weeks on Watchdog reset problem. But fortunately for me in the ramdump files (in ARM development environment), which has one Interrupt handler trace buffer, containing PC and SLR at each of the interrupts. Thus from the trace buffer I could exactly find out which part of code was running before WD reset.
I think if you have same kind of mechanism of storing PC, SLR at each interrupt then you can precisely find out culprit task.

Depending on your system and OS, there may be different approaches. One very low level approach I have used is to blink an LED on when each of the tasks is running. You may need to put a scope on the LEDs to see very fast task switching.

For an interrupt-driven watchdog, you'd just make the task switcher update the currently running task number each time it is changed, allowing you to identify which one didn't yield.
However, you suggest you wrote the watchdog as a task yourself, so before rebooting, surely the watchdog can identify the starved task? You can store this in memory that persists beyond a warm reboot, or send it over a debug interface. The problem with this is that the starved task is probably not the problematic one: you'll probably want to know the last few task switches (and times) in order to identify the cause.

A simplistic, back of the napkin approach would be something like this:
int8_t wd_tickle[NUM_TASKS]
void taskA_main()
{
...
// main loop
while(1) {
...
wd_tickle[TASKA_NUM]++;
}
}
... tasks B, C, D... follow similar pattern
void watchdog_task()
{
for(int i= 0; i < NUM_TASKS; i++) {
if(0 == wd_tickle[i]) {
// Egads! The task didn't kick us! Reset and record the task number
}
}
}

How is your system working exactly? I always use a combination of software and hardware watchdogs. Let me explain...
My example assumes you're working with a preemptive real time kernel and you have watchdog support in your cpu/microcontroller. This watchdog will perform a reset if it was not kicked withing a certain period of time. You want to check two things:
1) The periodic system timer ("RTOS clock") is running (if not, functions like "sleep" would no longer work and your system is unusable).
2) All threads can run withing a reasonable period of time.
My RTOS (www.lieron.be/micror2k) provides the possibility to run code in the RTOS clock interrupt handler. This is the only place where you refresh the hardware watchdog, so you're sure the clock is running all the time (if not the watchdog will reset your system).
In the idle thread (always running at lowest priority), a "software watchdog" is refreshed. This is simply setting a variable to a certain value (e.g. 1000). In the RTOS clock interrupt (where you kick the hardware watchdog), you decrement and check this value. If it reaches 0, it means that the idle thread has not run for 1000 clock ticks and you reboot the system (can be done by looping indefinitely inside the interrupt handler to let the hardware watchdog reboot).
Now for your original question. I assume the system clock keeps running, so it's the software watchdog that resets the system. In the RTOS clock interrupt handler, you can do some "statistics gathering" in case the software watchdog situation occurs. Instead of resetting the system, you can see what thread is running at each clock tick (after the problem occurs) and try to find out what's going on. It's not ideal, but it will help.
Another option is to add several software watchdogs at different priorities. Have the idle thread set VariableA to 1000 and have a (dedicated) medium priority thread set Variable B. In the RTOS clock interrupt handler, you check both variables. With this information you know if the looping thread has a priority higher then "medium" or lower then "medium". If you wish you can add a 3rd or 4th or how many software watchdogs you like. Worst case, add a software watchdog for each priority that's used (will cost you as many extra threads though).

Related

STM32F407VG Standby mode wake up reason — WUTF flag always set

I’m writing a low power application for the STM32F407VG. It goes into standby mode and can wake up in two ways:
Periodically, using the RTC wakeup timer;
By pressing a push-button connected to the PA0-WKUP pin.
Depending on whether the application was woken up by the RTC or the push-button, I need to perform two different tasks. Therefore, when the firmware resets after waking up from standby mode, I must figure out the wakeup reason (RTC or push-button).
I’ve made the necessary configurations to wake up from Standby mode from either source, and they’re working — the processor does wake up periodically, or when I hit the push-button. The issue is with figuring out the wakeup reason.
The documentation for the RTC_ISR register’s WUTF states the following:
Bit 10 WUTF: Wakeup timer flag
This flag is set by hardware when the wakeup auto-reload counter
reaches 0.
This flag is cleared by software by writing 0.
This flag must be cleared by software at least 1.5 RTCCLK periods before WUTF is
set to 1 again.
This seems perfect to me — if the flag is set, it must be because the wakeup timer reached 0 and woke up the processor.
I inserted some code at the beginning of my firmware to read WUTF and set an LED according to it, and then clear the flag immediately after that. Unfortunately, this flag is always set, not only when waking up from Standby mode due to the RTC, but also when waking up due to the push-button, and even when powering on the circuit for the first time.
I checked the errata sheet for this MCU and found no mention of this issue.
I do realize a workaround would be to read the status of the push-button, and if it corresponds to the pressed state, assume the wakeup reason is due to the push-button being pressed. However, my firmware runs for only a couple of microseconds in Run mode before going back into Standby mode, and due to bouncing issues with the push-button, this kind of detection is not reliable unless I stretch out the Run mode time to several microseconds. This in turn impacts the average power consumption of my application (and therefore battery life). While adding a capacitor might help, I’d like to implement a software-only solution if possible.
It was entirely my bad. I was reading the flag through the following HAL macro:
__HAL_RTC_WAKEUPTIMER_GET_FLAG(&hRTC, RTC_FLAG_WUTF);
It turns out I was using it before initializing hRTC.Instance, so rather than accessing the RTC's registers, it was just reading some random memory (probably address 0). After fixing it, the flag appears to work reliably.

Triggering a software event from an interrupt (XMEGA, GCC)

I want to run a periodic "housekeeping" event, triggered regularly by a timer interrupt. The interrupt fires frequently (kHz+), while the function may take a long time to finish, so I can't simply have it executed in line.
In the past, I've done this on an ATMEGA, where an ISR can simply permit other interrupts to fire (including itself again) with sei(). By wrapping the event in a "still executing" flag, it won't pile up on the stack and cause a... you know:
if (!inFunction) { inFunction = true; doFunction(); inFunction = false; }
I don't think this can be done -- at least as easily -- on the XMEGA, due to the PMIC interrupt controller. It appears the interrupt flags can only be reset by executing RETI.
So, I was thinking, it would be convenient if I could convince GCC to produce a tail call out of an interrupt. That would immediately execute the event, while clearing interrupts.
This would be easy enough to do in assembler, just push the address and IRET. (Well, some stack-mangling because ISR, but, yeah.) But I'm guessing it'll be a hack in GCC, possibly a custom ASM wrapper around a "naked" function?
Alternately, I would love to simply set a low priority software interrupt, but I don't see an intentional way to do this.
I could use software to trigger an interrupt from an otherwise unused peripheral. That's fine as a special case, but then, if I ever need to use that device, I have to find another. It's bad for code reuse, too.
Really, this is an X-Y problem and I know it. I think I want to do X, but really I need method Y that I just don't know about.
One better method is to set a flag, then let main() deal with it when it gets around to it. Unfortunately, I have blocking functions in main() (handling user input via serial), so that would take work, and be a mess.
The only "proper" method I know of offhand, is to do a full task switch -- but damned if I'm going to effectively implement an RTOS, or pull one in, just for this. There's got to be a better way.
Have I actually covered all the possibilities, and painted myself into a corner? Do I have to compromise and choose one of these? Am I missing anything better?
There are more possibilities to solve this.
1. Enable your timer interrupt as low priority. In this way the medium and high priority interrupts will be able to interrupt this low priority interrupt, and run unaffected.
This is similar to using sei(); in your interrupt handler in older processors (without PMIC).
2.a Set a flag (variable) in the interrupt. Poll the flag in the main loop. If the flag is set, clear it and do your stuff.
2.b Set up the timer but don't enable its interrupt. Poll the OVF interrupt flag of your timer in the main loop. If the flag is set, clear it and do your stuff.
These are timed less accurately according to what else the main loop does, so depends on your expectations for accuracy. Handling more tasks in the main loop without an OS: Cooperative multitasking, State machine.

how to figure out if process is really running or waiting to run on Linux?

This is purely academic question, I don't really need to know this information for anything, but I would like to understand kernel a bit more :-)
According to kernel documentation http://www.tldp.org/LDP/tlk/kernel/processes.html processes in linux kernel have following states:
Running
The process is either running (it is the current process in the
system) or it is ready to run (it is waiting to be assigned to one of
the system's CPUs).
Waiting
The process is waiting for an event or for a resource. Linux
differentiates between two types of waiting process; interruptible and
uninterruptible. Interruptible waiting processes can be interrupted by
signals whereas uninterruptible waiting processes are waiting directly
on hardware conditions and cannot be interrupted under any
circumstances.
Stopped
The process has been stopped, usually by receiving a signal. A process
that is being debugged can be in a stopped state.
Zombie
This is a halted process which, for some reason, still has a
task_struct data structure in the task vector. It is what it sounds
like, a dead process.
As you can see, when I take a snapshot of processes state, using command like ps, I can see, if it's in Running state, that process either was literally Running or just waiting to be assigned to some CPU by kernel.
In my opinion, these 2 states (that are actually both represented by 1 state in task_struct) are quite different.
Why there is no state like "Ready" that would mean the process is "ready to run" but wasn't assigned to any CPU so far, so that the task_struct would be more clear about the real state? Is it even possible to retrieve this information, or is it secret for whatever reason which process is "literally running" on the CPU?
The struct task_struct contains a long to represent current state:
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
This simply indicates if a process is 'runnable'.
To see the currently executing process you should look at the runqueue. Specifically a struct rq (as defined in kernel/sched/sched.h) contains:
struct task_struct *curr, *idle, *stop;
The pointer *curr is the currently running process on this runqueue (there exists a runqueue per CPU).
You should consult files under kernel/sched/ to see how the Kernel determines which processes should be scheduled according to the different scheduling algorithms if you are interested in exactly how it arrives at the running state.
This is not a linux-kernel answer but a more general about scheduling ^^
A core part of any OS is the Scheduler: http://en.wikipedia.org/wiki/Process_scheduler
Many of them work giving every process a time slice of execution and letting each of them do a little bit of work before switching (referred as a context switch) to another process.
Since the length of a time slice is in the order of milliseconds by the time the information you requested is shown, the state has surely changed so differentiate between "Really Running" and "Ready-but-not-really-running" could result (most of the time) in inaccurate informations.

How do I increase windows interrupt latency to stress test a driver?

I have a driver & device that seem to misbehave when the user does any number of complex things (opening large word documents, opening lots of files at once, etc.) -- but does not reliably go wrong when any one thing is repeated. I believe it's because it does not handle high interrupt latency situations gracefully.
Is there a reliable way to increase interrupt latency on Windows XP to test this theory?
I'd prefer to write my test programn in python, but c++ & WinAPI is also fine...
My apologies for not having a concrete answer, but an idea to explore would be to use either c++ or cython to hook into the timer interrupt (the clock tick one) and waste time in there. This will effectively increase latency.
I don't know if there's an existing solution. But you may create your own one.
On Windows all the interrupts are prioritized. So that if there's a driver code running on a high IRQL, your driver won't be able to serve your interrupt if its level is lower. At least it won't be able to run on the same processor.
I'd do the following:
Configure your driver to run on a single processor (don't remember how to do this, but such an option definitely exists).
Add an I/O control code to your driver.
In your driver's Dispatch routine do a busy wait on a high IRQL (more about this later)
Call your driver (via DeviceIoControl) to simulate a stress.
The busy wait may look something like this:
KIRQL oldIrql;
__int64 t1, t2;
KeRaiseIrql(31, &oldIrql);
KeQuerySystemTime((LARGE_INTEGER*) &t1);
while (1)
{
KeQuerySystemTime((LARGE_INTEGER*) &t2);
if (t1 - t1 > /* put the needed time interval */)
break;
}
KeLowerIrql(oldIrql);

How to wait for one second on an 8051 microcontroller?

I'm supposed to write a program that will send some values to registers, then wait one second, then change the values. The thing is, I'm unable to find the instruction that will halt operations for one second.
How about setting up a timer interrupt ?
Some useful hints and code snippets in this Keil 8051 application note.
There is no such 'instruction'. There is however no doubt at least one hardware timer peripheral (the exact peripheral set depends on the exact part you are using). Get out the datasheet/user manual for the part you are using and figure out how to program the timer; you can then poll it or use interrupts. Typically you'd configure the timer to generate a periodic interrupt that then increments a counter variable.
Two things you must know about timer interrupts: Firstly, if your counter variable is greater than 8-bit, access to it will not be atomic, so outside of the interrupt context you must either temporarily disable interrupts to read it, or read it twice in succession with the same value to validate it. Secondly, the timer counter variable must be declared volatile to prevent the compiler optimising out access to it; this is true of all variables shared between interrupts and threads.
Another alternative is to use a low power 'sleep' mode if supported; you set up a timer to wake the processor after the desired period, and issue the necessary sleep instruction (this may be provided as an 'intrinsic' by your compiler, or you may be controlled by a peripheral register. This is general advice, not 8051 specific; I don't know if your part even supports a sleep mode.
Either way you need to wade through the part specific documentation. If you could tell us the exact part, you may get help with that.
A third solution is to use an 8051 specific RTOS kernel which will provide exactly the periodic delay function you are looking for, as well as multi-threading and IPC.
I would setup a timer so that it interrupts every 10ms. In that interrupt, increment a variable.
You will also need to write a function to disable interrupts and read that variable.
In your main program, you will read the timer variable and then wait until it is 10100 more than it is when you started.
Don't forget to watch out for the timer variable rolling over.

Resources