need help debugging an unstable program - debugging

Following some changes my Arduino sketch became unstable, it only run 1-2 hours and crashes. It's now a month that I am trying to understand but do not make sensible progress: the main difficulty is that the slightest change make it run apparently "ok" for days...
The program is ~1500 lines long
Can someone suggest how to progress?
Thanks in advance for your time

Well, the embedded systems are wery well known for continuous fight against Universe forth dimension: time. It is known that some delays must be added inside code - this does not imply the use of a sistem delay routine always - just operation order may solve a lot.
Debugging a system with such problem is difficult. Some techniques could be used:
a) invasive ones: mark (i.e. use some printf statements) in various places of your software, entry or exit of some routines or other important steps and run again - when the application crashes, you must note the last message seen and conclude the crash is after that software step marked by the printf.
b) less invasive: use an available GPIO pin as output and set it high at the entry of some routine and low at the exit; the crasing point will leave the pin either high or low. You can use several pins if available and watch the activity with an oscilloscope.
c) non invasive - use the JTAG or SWD debugging - this is the best one - if your micro support faults debugging, then you have the methods to locate the bug.

Related

Executing simple code on secondary processors, Pre-boot or in DOS

On AMD-64 (family 15h) in barebone legacy mode - such as pre-booting, or in DOS -
I wish to momentarily 'awake' each application processor in turn, have it run a (very short) sequence of instructions, and put it back to its previous waiting or sleeping state.
For specifics, I want each processor to 'WrMSR' some microcode update blob, newer than what's currently carved in BIOS. But the question can be, more generally : how to awake processor number N, set its state (CS:rIP) so it executes a prepared thread of instructions from DRAM and finally goes back to 'sleep' quietly?
This should be done with the lightest possible of machineries, in real mode as far as possible, using no "ACPI" tables, helpers, and stuff ! Also, for the kind of very basic and short tasks envisioned, the app processors would not need to serve hardware interrupts, so, I guess, no special APIC setup is needed.
I'd appreciate a sketch of the essential steps, with assembler or pseudo-code, and or pointers to relevant documentation or example code.

How to measure ISR execution time?

I am on linux kernel 2.6.32.
I am facing an issue in which one of the two ISR (serial and ethernet) are taking more time (hundreds of microseconds) on several occasion/under some scenarios which I don't know. I would like to get the time difference every time the ISR executes.
What would be the best way (least expensive in terms of overhead involved). I don't see ARM architecture has some TSC register (read_tsc api) which would give me direct access to time as it offers on some other architecture.
So Idea is
1) The moment ISR is invoked measure time
2) the moment ISR is complete measure the time.
3) get the difference of 1 and 2 store it in some variable.
4) Keep doing the steps 1 to 2 and when the value received in the step 3 is greater than the past value overwrite it (keep/preserve value with maximum latency).
When the issue happens (some abrupt condition print the value) or array of last 10 value).
I need to do in kernel driver so let me know what would be the least expensive way.
OMAP3 has Cortex-A8 core. That does have Performance Monitor Unit (PMU). Cycle Count (CCNT) would correspond to x86 TSC, except probably you have to enable it counting before you read. Good info in BeagleBoard post.
In 2.6.32.55 I see arch/arm/oprofile/op_model_v7.c gives full access and control. My need was bare-metal, I used ARM example code that was simple and worked for me.
It would also be possible to use an OMAP3 GPT, but that would be more work, e.g. to get its clock input set up from PRCM.

Drastic performance inprovement in .NET CF after app gets moved out of the foreground. Why?

I have a large (500K lines) .NET CF (C#) program, running on CE6/.NET CF 3.5 (v.3.5.10181.0). This is running on a FreeScale i.Mx31 (ARM) # 400MHz. It has 128MB RAM, with ~80MB available to applications. My app is the only significant one running (this is a dedicated, embedded system). Managed memory in use (as reported by GC.Collect) is about 18MB.
To give a better idea of the app size, here's some stats culled from .NET CF Remote Performance Monitor after staring up the application:
GC:
Garbage Collections 131
Bytes Collected by GC 97,919,260
Managed Bytes in use after GC 17,774,992
Total Bytes in use after GC 24,117,424
GC Compactions 41
JIT:
Native Bytes Jitted: 10,274,820
Loader:
Classes Loaded 7,393
Methods Loaded 27,691
Recently, I have been trying to track down a performance problem. I found that my benchmark after running the app in two different startup configurations would run in approximately 2 seconds (slow case) vs. 1 second (fast case). In the slow case, the time for the benchmark could change randomly from EXE run to EXE run from 1.1 to 2 seconds, but for any given EXE run, would not change for the life of the application. In other words, you could re-run the benchmark and the time for the test stays the same until you restart the EXE, at which point a new time is established and consistent.
I could not explain the 1.1 to 2x slowdown via any conventional mechanism, or by narrowing the slowdown to any particular part of the benchmark code. It appeared that the overall process was just running slower, almost like a thread was spinning and taking away some of "my" CPU.
Then, I randomly discovered that just by switching away from my app (the GUI loses the foreground) to another app, my performance issue disappears. It stays gone even after returning my app to the foreground. I now have a tentative workaround where my app after startup launches an auxiliary app with a 1x1 size window that kills itself after 5ms. Thus the aux app takes the foreground, then relinquishes it.
The question is, why does this speed up my application?
I know that code gets pitched when a .NET CF app loses the foreground. I also notice that when performing a "GC Heap" capture with .NET CF Remote Performance Monitor, a Code Pitch is logged -- and this also triggers the performance improvement in my app. So I suspect somehow that code pitching is related or even responsible for fixing performance. But I'm at a loss as to figure out how to determine if that is really the case, or even to explain why pitching code could help in this way. Does pitching out lots of code somehow significantly help locality of reference of code pages (that are re-JITted, presumably near each other in memory) enough to help this much? (My benchmark spans probably 3 dozen routines and hundreds of lines of code.)
Most importantly, what can I do in my app to reliably avoid this slower condition. Any pointers to relevant .NET CF / JIT / Code pitching information would be greatly appreciated.
Your app going to the background auto-triggers a GC.Collect, which collects, may compact the GC Heap and may pitch code. Have you checked to see if a manual GC.Collect without going to the background gives the same behavior? It might not be pitching that's giving the perf gain, it might be collection or compaction. If a significant number of dead roots are swept up, walking the root tree may be getting faster. Can't say I've specifically seen this issue, so this is all conjecture.
Send your app a wm_hibernate before your performance loop. Will clean up things
We have a similar issue with our .NET CF application.
Over time, our application progressively slows down, eventually to a halt with what I anticipate is due to high CPU load, or as #wil-s says, as if thread is spinning consuming CPU. The only assumption / conclusion I've made to so far is either we have a rogue thread in our code, or there's an under the cover issue in .NET CF, maybe with the JITter.
Closing the application and re-launching returns our application to normal expected performance.
I am yet to implement code change to test issuing WM_HIBERNATE or launch a dummy app which quits itself (as above) to force a code pitch, but am fairly sure this will resolve our issue based on the above comments. (so many thanks for that)
However, I'm subsequently interested to know whether a root cause was ever found to this specific issue?
Incidentally and seemingly off topic (but bear with me), we're using a Freescale i.MX28 processor and are experiencing unpredictable FlashDisk corruption. Seeing 2K blocks of 0xFFs (erased blocks) in random files located on NAND Flash.
I'm mentioning this as I now believe the CPU and FlashDisk corruption issues are linked, due to this article as well as this one:
https://electronics.stackexchange.com/questions/26720/flash-memory-corruption-due-to-electricals
In the article, #jwygralak67 comments:
I recently worked through a flash corruption issue, on a WinCE system,
as part of a development team. We would sporadically find 2K blocks of
flash that were erased. (All bytes 0xFF) For about 6 months we tested
everything from ESD, to dirty power to EMI and RFI interference, we
bought brand new devices and tracked usage to make sure we weren't
exceeding the erase cycle limit and buring out blocks, we went through
our (application level) software with a fine toothed comb.
In the end it turned out to be an obscure bug in the very low level
flash driver code, which only occurred under periods of heavy CPU
load. The driver came from a 3rd party. We informed them of the issue
we found, but I don't know if they ever released a patch.
Unfortunately, we're yet to make contact with him.
With all of this in mind, potentially if we work around the high CPU load, maybe the corruption will no longer be present. Another conjecture case!
On that assumption however, this doesn't give a firm root cause for either situation, which I'm desperately seeking!
Any knowledge or insight, however small, would be very gratefully received.
#ctacke - we've spoken before regarding OpenNETCF via email, so I'm pleased to see your name!

What simple method can I use to debug an embedded processor without serial port or video?

We have a small embedded system without any video or serial ports (i.e. we can't output text via printf).
We would like to track the progress of our code through the initialization sequence.
Is there some simple things we can do to help with this.
It is not running any OS, and the hardware platform is somewhat customizable.
The simplest most scalable solution are state LEDs. Toggle LEDs based on actions, either in binary form or when certain actions occur if you can narrow your focus.
The most powerful will be a hardware JTAG device. You don't even need to set breakpoints - simply being able to stop the application and inspect the state of memory may be enough. Note that some hardware platforms do not support "fancy" options such as memory watches or hardware breakpoints. The former is usually worked around with constantly stopping the processor and reading memory (turns your 10MHz system into a 1kHz system), while the latter is sometimes performed using code replacement (replace the targeted instruction with a different jump), which sometimes masks other problems. Be aware of these issues and which embedded processors they apply to.
There are a few strategies you can employ to help with debugging:
If you have Output Pins available, you can hook them up to LEDs (or an oscilloscope) and toggle the output pins high/low to indicate that certain points have been reached in the code.
For example, 1 blink might be program loaded, 2 blink is foozbar initialized, 3 blink is accepting inputs...
If you have multiple output lines available, you can use a 7 segment LED to convey more information (numbers/letters instead of blinks).
If you have the capabilities to read memory and have some RAM available, you can use the sprint function to do printf-like debugging, but instead of going to a screen/serial port, it is written in memory.
It depends on the type of debugging that you're trying to do - in particular if you're after a temporary method of tracing or if you are trying to provide a tool that can be used as an indication of status during the life of the project (or product).
For one off, in depth source tracing and debugging an in-circuit debugger (eg. jtag) can be very helpful. However, they are most helpful where your debugging requires setting breakpoints and investigating memory and registers - which makes it of little benefit where you are dealing time critical problems.
Where you need to determine program state without having a significant impact on the execution of your system the use of LEDs connected to spare I/O pins will be helpful. These can also be used as the input to a digital storage oscilloscope (DSO) or logic analyzer.
This technique can be made more powerful by selecting unique patterns of pulses that will be identifiable on the DSO.
For a more versatile debugging tool, though, a serial port is a good solution. To save cost and PCB real-estate you may find it useful to use an plug-in module that contains the RS232 converters.
If you are trying to provide a longer term indication of status as part of the normal operation of your product, LEDs are again a cheap an simple method. However in this situation it is best to choose patterns of pulses that are slow enough to be easily identified by visual inspection. This will all you over time you will learn a particular pattern that represents "normal" behavior.
You can easily emulate serial communications (UARTs) using bit-banging from the IO pins of the system. Hook it to one of the card's pins and attach to a RS232 converter there (TTL to RS232 converters are easy to either buy or build), which goes to your PC's serial port.
A JTAG debugger is also an option, though cumbersome to set up.
If you don't have JTAG, the LEDs suggested by the others are a great idea - although you do tend to end up in a test/rebuild cycle to try to track down the issue.
If you've got more time, and spare hardware pins, and memory to spare, you could always bit-bash a low speed serial interface. I've found that pretty useful in the past.
Others have suggested some pretty good ideas using output pins, so I won't suggest that, although it can be a very good solution, and is very cost effective. If your budget and target processor support it, a hardware trace system, (either an old fashioned emulator, or a fancy BDM with bus snooping trace support) can be great for this type of thing. It's very expensive though.
The idea of using a bit-banged software UART is nice, but there's some effort required in writing one and also you need some free timers and interrupts. If your hardware has any other unused serial interface (SPI, I2C, ..), using them would be easier. With a small microcontroller you could convert the interface to RS-232.
If you have to go for the bit-banging, making a synchronous serial might be a simpler alternative as it wouldn't be critical to timing.

Power Efficient Software Coding

In a typical handheld/portable embedded system device Battery life is a major concern in design of H/W, S/W and the features the device can support. From the Software programming perspective, one is aware of MIPS, Memory(Data and Program) optimized code.
I am aware of the H/W Deep sleep mode, Standby mode that are used to clock the hardware at lower Cycles or turn of the clock entirel to some unused circutis to save power, but i am looking for some ideas from that point of view:
Wherein my code is running and it needs to keep executing, given this how can I write the code "power" efficiently so as to consume minimum watts?
Are there any special programming constructs, data structures, control structures which i should look at to achieve minimum power consumption for a given functionality.
Are there any s/w high level design considerations which one should keep in mind at time of code structure design, or during low level design to make the code as power efficient(Least power consuming) as possible?
Like 1800 INFORMATION said, avoid polling; subscribe to events and wait for them to happen
Update window content only when necessary - let the system decide when to redraw it
When updating window content, ensure your code recreates as little of the invalid region as possible
With quick code the CPU goes back to deep sleep mode faster and there's a better chance that such code stays in L1 cache
Operate on small data at one time so data stays in caches as well
Ensure that your application doesn't do any unnecessary action when in background
Make your software not only power efficient, but also power aware - update graphics less often when on battery, disable animations, less hard drive thrashing
And read some other guidelines. ;)
Recently a series of posts called "Optimizing Software Applications for Power", started appearing on Intel Software Blogs. May be of some use for x86 developers.
Zeroith, use a fully static machine that can stop when idle. You can't beat zero Hz.
First up, switch to a tickless operating system scheduler. Waking up every millisecend or so wastes power. If you can't, consider slowing the scheduler interrupt instead.
Secondly, ensure your idle thread is a power save, wait for next interrupt instruction.
You can do this in the sort of under-regulated "userland" most small devices have.
Thirdly, if you have to poll or perform user confidence activities like updating the UI,
sleep, do it, and get back to sleep.
Don't trust GUI frameworks that you haven't checked for "sleep and spin" kind of code.
Especially the event timer you may be tempted to use for #2.
Block a thread on read instead of polling with select()/epoll()/ WaitForMultipleObjects().
Puts stress on the thread scheuler ( and your brain) but the devices generally do okay.
This ends up changing your high-level design a bit; it gets tidier!.
A main loop that polls all the things you Might do ends up slow and wasteful on CPU, but does guarantee performance. ( Guaranteed to be slow)
Cache results, lazily create things. Users expect the device to be slow so don't disappoint them. Less running is better. Run as little as you can get away with.
Separate threads can be killed off when you stop needing them.
Try to get more memory than you need, then you can insert into more than one hashtable and save ever searching. This is a direct tradeoff if the memory is DRAM.
Look at a realtime-ier system than you think you might need. It saves time (sic) later.
They cope better with threading too.
Do not poll. Use events and other OS primitives to wait for notifiable occurrences. Polling ensures that the CPU will stay active and use more battery life.
From my work using smart phones, the best way I have found of preserving battery life is to ensure that everything you do not need for your program to function at that specific point is disabled.
For example, only switch Bluetooth on when you need it, similarly the phone capabilities, turn the screen brightness down when it isn't needed, turn the volume down, etc.
The power used by these functions will generally far outweigh the power used by your code.
To avoid polling is a good suggestion.
A microprocessor's power consumption is roughly proportional to its clock frequency, and to the square of its supply voltage. If you have the possibility to adjust these from software, that could save some power. Also, turning off the parts of the processor that you don't need (e.g. floating-point unit) may help, but this very much depends on your platform. In any case, you need a way to measure the actual power consumption of your processor, so that you can find out what works and what not. Just like speed optimizations, power optimizations need to be carefully profiled.
Consider using the network interfaces the least you can. You might want to gather information and send it out in bursts instead of constantly send it.
Look at what your compiler generates, particularly for hot areas of code.
If you have low priority intermittent operations, don't use specific timers to wake up to deal with them, but deal with when processing other events.
Use logic to avoid stupid scenarios where your app might go to sleep for 10 ms and then have to wake up again for the next event. For the kind of platform mentioned it shouldn't matter if both events are processed at the same time.
Having your own timer & callback mechanism might be appropriate for this kind of decision making. The trade off is in code complexity and maintenance vs. likely power savings.
Simply put, do as little as possible.
Well, to the extent that your code can execute entirely in the processor cache, you'll have less bus activity and save power. To the extent that your program is small enough to fit code+data entirely in the cache, you get that benefit "for free". OTOH, if your program is too big, and you can divide your programs into modules that are more or less independent of the other, you might get some power saving by dividing it into separate programs. (I suppose it's also possible to make a toolchain that spreas out related bundles of code and data into cache-sized chunks...)
I suppose that, theoretically, you can save some amount of unnecessary work by reducing the number of pointer dereferencing, and by refactoring your jumps so that the most likely jumps are taken first -- but that's not realistic to do as a programmer.
Transmeta had the idea of letting the machine do some instruction optimization on-the-fly to save power... But that didn't seem to help enough... And look where that got them.
Set unused memory or flash to 0xFF not 0x00. This is certainly true for flash and eeprom, not sure about s or d ram. For the proms there is an inversion so a 0 is stored as a 1 and takes more energy, a 1 is stored as a zero and takes less. This is why you read 0xFFs after erasing a block.
Rather timely this, article on Hackaday today about measuring power consumption of various commands:
Hackaday: the-effect-of-code-on-power-consumption
Aside from that:
- Interrupts are your friends
- Polling / wait() aren't your friends
- Do as little as possible
- make your code as small/efficient as possible
- Turn off as many modules, pins, peripherals as possible in the micro
- Run as slowly as possible
- If the micro has settings for pin drive strengh, slew rate, etc. check them & configure them, the defaults are often full power / max speed.
- returning to the article above, go back and measure the power & see if you can drop it by altering things.
also something that is not trivial to do is reduce precision of the mathematical operations, go for the smallest dataset available and if available by your development environment pack data and aggregate operations.
knuth books could give you all the variant of specific algorithms you need to save memory or cpu, or going with reduced precision minimizing the rounding errors
also, spent some time checking for all the embedded device api - for example most symbian phones could do audio encoding via a specialized hardware
Do your work as quickly as possible, and then go to some idle state waiting for interrupts (or events) to happen. Try to make the code run out of cache with as little external memory traffic as possible.
On Linux, install powertop to see how often which piece of software wakes up the CPU. And follow the various tips that the powertop site links to, some of which are probably applicable to non-Linux, too.
http://www.lesswatts.org/projects/powertop/
Choose efficient algorithms that are quick and have small basic blocks and minimal memory accesses.
Understand the cache size and functional units of your processor.
Don't access memory. Don't use objects or garbage collection or any other high level constructs if they expands your working code or data set outside the available cache. If you know the cache size and associativity, lay out the entire working data set you will need in low power mode and fit it all into the dcache (forget some of the "proper" coding practices that scatter the data around in separate objects or data structures if that causes cache trashing). Same with all the subroutines. Put your working code set all in one module if necessary to stripe it all in the icache. If the processor has multiple levels of cache, try to fit in the lowest level of instruction or data cache possible. Don't use floating point unit or any other instructions that may power up any other optional functional units unless you can make a good case that use of these instructions significantly shortens the time that the CPU is out of sleep mode.
etc.
Don't poll, sleep
Avoid using power hungry areas of the chip when possible. For example multipliers are power hungry, if you can shift and add you can save some Joules (as long as you don't do so much shifting and adding that actually the multiplier is a win!)
If you are really serious,l get a power-aware debugger, which can correlate power usage with your source code. Like this

Resources