Debugging a micro-processor - debugging

One of our co-processors is an 8-bit microprocessor. It's main role is to control the hardware that handles flash memory. We suspect that the code it's running is highly inefficient since we measured low speeds when reading/writing to flash memory. The problem is, we have only one J-TAG port that's connected to the main CPU so debugging it is not an option. What we do have, is a register that's available from CPU that contains the micro-processor's program counter. The bad news, is that the micro-processor works at a different frequency than the CPU so monitoring it's program counter outside is also hard. Measuring time inside the micro-processor is also very difficult since it's registers are only 8-bit long. Needless to say, the code is in assembly and very complex. How would you go about approaching this problem?

Needless to say, the code is in assembly and very complex. How would you go about approaching this problem?
I would advise that you start from (or generate) the requirements specification for this part and reimplement the code in C (or even careful use of a C++ subset). If the "complexity" you perceive is merely down the the code rather than the requirements it would be a good idea to design that out - it will only make maintenance in the future more complex, error prone and expensive.
One of the common arguments for using assembler are size and performance, but more frequently a large body of assembler code is far from optimal; in order to retain a level of productivity and maintainability often "boiler-plate" code is used and reused that is not tailored to the specific situation, whereas a compiler will analyse code changes and perform the kind of "micro-optimisation" that system designers really shouldn't have to sweat about. Make your algorithms and data structures efficient and leave the target instruction set details to the compiler.
Even without the ability to directly debug on the target, the use of a high-level language will allow prototyping and simulation on a PC for example.
Even if you retain the assembler code, if your development tools include an instruction set simulator, that may be a good alternative to hardware debugging; especially if it supports debugger scripts that can be used to simulate the behaviour of hardware devices.
All that said, looking at this as a "black-box" and concluding that the code is inefficient is a bit of a leap. What kind of flash memory is appearing to be slow for example? How is it interfaced to the microcontroller? And how have you measured this performance? Flash memory is intrinsically slow - especially writing and page erase; check the performance specification of the Flash before drawing any conclusion on the software performance.

Related

Stall all memory accesses of an application

I want to analyze the effect of using slower memories on applications and need a mean to add delay for all memory accesses. Until now I investigated Intel PIN and other software but they seem to be overkill for what I need. Is there any tool to do so?
Is adding NOP operations in the binary code of the application right before each LOAD/STORE a feasible way?
Your best bet is to run your application under an x86 simulator such as MARSSx86 or Sniper. Using these simulators you can smoothly vary the modeled memory latency or any other parameters of the system1 and see how your application performance varies. This is a common approach in academia (often a generic machine will be modeled, rather than x86, which gives you access to more simulator implementations).
The primary disadvantage of using a simulator is that even good simulators are not completely accurate, and how accurate they are depends on the code in question. Certain types of variance from actual performance aren't particularly problematic when answering the question "how does performance vary with latency", but a simulator that doesn't model the memory access path well might produce a answer that is far from reality.
If you really can't use simulation, you could use any binary re-writing tool like PIN to instrument the memory access locations. nop would be a bad choice because it executes very quickly because you cannot add a dependency between the memory load result and the nop instruction. This latter issue means it only adds additional "work" at the location of each load, but that the work is independent of the load itself, so doesn't simulate increased memory latency.
A better approach would be to follow each load with a long latency operation that uses the result of the load as input and output (but doesn't modify it). Maybe something like imul reg, reg, 1 if reg received the result of the load (but this only adds 3 cycles, so you might hunt for longer latency instructions if you want to add a lot of latency).
1 At least within the set of things modeled by the simulator.

ARM NEON: Tools to predict performance issues due to memory access limited bandwidth?

I am trying to optimize critical parts of a C code for image processing in ARM devices and recently discovered NEON.
Having read tips here and there, I am getting pretty nice results, but there is something that escapes me. I see that overall performance is very much dependant on memory accesses and how they are done.
Which is the simplest way (by simple I mean, if possible, not having to run the whole compiled code in an emulator or simulator, but something that can be feed of small pieces of assembly and analyze them), in order to get an idea of how memory accesses are "bottlenecking" the subroutine?
I know this can not be done exactly without running it in a specific hardware and specific conditions, but the purpose is to have a "comparison" trial-and error tool to experiment with, even if the results are only approximations.
(something similar to this great tool for cycle counting)
I think you've probably answered your own question. Memory is a system level effect and many ARM implementers (Apple, Samsung, Qualcomm, etc) implement the system differently with different results.
However, of course you can optimize things for a certain system and it will probably work well on others, so really it comes down to figuring out a way that you can quickly iterate and test/simulate system level effects. This does get complicated so you might pay some money for system level simulators such as is included in ARM's RealView. Or I might recommend getting some open source hardware like a Panda Board and using valgrind's cache-grind. With linux on the panda board you can write some scripts to automate your testing.
It can be a hassle to get this going but if optimizing for ARM will be part of your professional life, then it's worth the (relatively low compared to your salary) software/hardware investment and time.
Note 1: I recommend against using PLD. This is very system tuning dependent, and if you get it working well on one ARM implementation it may hurt you for the next generation of chip or a different implementation. This may be a hint that trying to optimize at the system level, other than some basic data localization and ordering stuff may not be worth your efforts? (See Stephen's comment below).
Memory access is one thing that simply cannot be modeled from "small pieces of assembly” to generate meaningful guidance. Cache hierarchies, store buffers, load miss queues, cache policy, etc … even relatively simple processors have an enormous amount of “state” hiding underneath the LSU, and any small-scale analysis cannot accurately capture that state. That said, there are a few basic guidelines for getting the best performance:
maximize the ratio of "useful computation” instructions to LSU operations.
align your memory accesses (ideally to 16B).
if you need to pick between aligning loads or aligning stores, align your stores.
try to write out complete cachelines when possible.
PLD is mainly useful for non-uniform-but-somehow-still-predictable memory access patterns (these are rare).
For NEON specifically, you should prefer to use the vld1 and vst1 instructions (with an alignment hint). On most micro-architectures, in most cases, they are the fastest way to move between NEON and memory. Eschew v[ld|st][3|4] in particular; these are an attractive nuisance, slower than doing separate permutes on most micro-architectures in most cases.

What types of code domains is OpenCL suited to?

I read the OpenCL overview, and it states it is suitable for code that runs of CPUs, GPGPUs, DSPs, etc. However, from looking through the command reference, it seems to be all math and image type operations. I didn't see anything for say strings.
This makes me wonder what would you run on a CPU via OpenCL?
Further, I know OpenCL can be used to perform sorting on GPGPUs. But would one ever use it (or, for that matter, a current GPGPU) to perform string processing such as pattern matching, metaphone extraction, dictionary lookup, or anything else that requires the processing of arrays of strings.
EDIT
I noticed that Intel's upcoming Ivy Bridge is touted as "OpenCL compliant" with reference to its graphics units. Does this infer that the CPU cores are not OpenCL compliant, or is there no such inference?
EDIT
In the interests of non-debate and constructiveness, I would appreciate if anyone could point me to official references that would answer my question.
You can think of OpenCL as a combination of a runtime (for device discovery, queueing) and a C-based programming language. This programming language has native vector types and built-in functions and operations for doing all sorts fun stuff to these vectors. This is nice in that you can write a vectorized kernel in OpenCL, and it it the responsibility of the implementation to map that to the actual vector ISA of your hardware.
From this 4/2011 article, which might vanish:
There are two major CPU architectures out there, x86 and ARM, both of
which should soon run OpenCL code.
If you write an OpenCL application that targets both of these architectures, you wouldn't have to worry about writing two versions, one SSE and one NEON. Just write OpenCL C and be done with it. Yes, I know. This assumes the vendor has done his job and written a solid implementation that fully utilizes the underlying ISA. But if he doesn't, complain!
In addition, some CL implementations offer auto-vectorization of scalar kernels, which are usually easier to write. A good auto-vectorizer would give you a solid performance increase for no effort. Since CL kernels are compiled "online," obtaining such a benefit wouldn't require shipping rebuilt code.
No links, but I would assume this is because algorithms that use strings may do a lot of dynamic memory allocation and branching, both of which GPGPUs are not well-suited for. GPGPUs also have a lot in common with vector processing, so doing units of work with different sized blocks of memory (which a string algorithm will generally work on, you usually don't have a homogeneous group of strings), yields poorer performance and is hard to program.
GPUs were designed to do the same work, with little to no branching, on a homogeneous group of data (such as per-vector or per-pixel operations). Algorithms that can mimic this type of behavior are great on GPUs.
This makes me wonder what would you run on a CPU via OpenCL?
I prefer to use ocl to offload work from the cpu to my graphics hardware. Sometimes there is a limitation with my video card, so I like having a backup kernel for cpu use. Such limitations can be memory size, memory bottleneck, low clock speed, or when the pci-e bus gets in the way.
I say I like using a separate kernel for cpu, because I think all kernels should be tweaked to run on their target hardware. I even like to have an openmp backup plan, as most algorithms I use get tested out in this manner ahead of time.
I suppose it is best practice to test out a gpu kernel on the cpu to make sure it runs as expected. If a user of your software has opencl installed, but only a cpu (or a low-end gpu) it's nice to be able to execute the same code on the different devices.

Power Efficient Software Coding

In a typical handheld/portable embedded system device Battery life is a major concern in design of H/W, S/W and the features the device can support. From the Software programming perspective, one is aware of MIPS, Memory(Data and Program) optimized code.
I am aware of the H/W Deep sleep mode, Standby mode that are used to clock the hardware at lower Cycles or turn of the clock entirel to some unused circutis to save power, but i am looking for some ideas from that point of view:
Wherein my code is running and it needs to keep executing, given this how can I write the code "power" efficiently so as to consume minimum watts?
Are there any special programming constructs, data structures, control structures which i should look at to achieve minimum power consumption for a given functionality.
Are there any s/w high level design considerations which one should keep in mind at time of code structure design, or during low level design to make the code as power efficient(Least power consuming) as possible?
Like 1800 INFORMATION said, avoid polling; subscribe to events and wait for them to happen
Update window content only when necessary - let the system decide when to redraw it
When updating window content, ensure your code recreates as little of the invalid region as possible
With quick code the CPU goes back to deep sleep mode faster and there's a better chance that such code stays in L1 cache
Operate on small data at one time so data stays in caches as well
Ensure that your application doesn't do any unnecessary action when in background
Make your software not only power efficient, but also power aware - update graphics less often when on battery, disable animations, less hard drive thrashing
And read some other guidelines. ;)
Recently a series of posts called "Optimizing Software Applications for Power", started appearing on Intel Software Blogs. May be of some use for x86 developers.
Zeroith, use a fully static machine that can stop when idle. You can't beat zero Hz.
First up, switch to a tickless operating system scheduler. Waking up every millisecend or so wastes power. If you can't, consider slowing the scheduler interrupt instead.
Secondly, ensure your idle thread is a power save, wait for next interrupt instruction.
You can do this in the sort of under-regulated "userland" most small devices have.
Thirdly, if you have to poll or perform user confidence activities like updating the UI,
sleep, do it, and get back to sleep.
Don't trust GUI frameworks that you haven't checked for "sleep and spin" kind of code.
Especially the event timer you may be tempted to use for #2.
Block a thread on read instead of polling with select()/epoll()/ WaitForMultipleObjects().
Puts stress on the thread scheuler ( and your brain) but the devices generally do okay.
This ends up changing your high-level design a bit; it gets tidier!.
A main loop that polls all the things you Might do ends up slow and wasteful on CPU, but does guarantee performance. ( Guaranteed to be slow)
Cache results, lazily create things. Users expect the device to be slow so don't disappoint them. Less running is better. Run as little as you can get away with.
Separate threads can be killed off when you stop needing them.
Try to get more memory than you need, then you can insert into more than one hashtable and save ever searching. This is a direct tradeoff if the memory is DRAM.
Look at a realtime-ier system than you think you might need. It saves time (sic) later.
They cope better with threading too.
Do not poll. Use events and other OS primitives to wait for notifiable occurrences. Polling ensures that the CPU will stay active and use more battery life.
From my work using smart phones, the best way I have found of preserving battery life is to ensure that everything you do not need for your program to function at that specific point is disabled.
For example, only switch Bluetooth on when you need it, similarly the phone capabilities, turn the screen brightness down when it isn't needed, turn the volume down, etc.
The power used by these functions will generally far outweigh the power used by your code.
To avoid polling is a good suggestion.
A microprocessor's power consumption is roughly proportional to its clock frequency, and to the square of its supply voltage. If you have the possibility to adjust these from software, that could save some power. Also, turning off the parts of the processor that you don't need (e.g. floating-point unit) may help, but this very much depends on your platform. In any case, you need a way to measure the actual power consumption of your processor, so that you can find out what works and what not. Just like speed optimizations, power optimizations need to be carefully profiled.
Consider using the network interfaces the least you can. You might want to gather information and send it out in bursts instead of constantly send it.
Look at what your compiler generates, particularly for hot areas of code.
If you have low priority intermittent operations, don't use specific timers to wake up to deal with them, but deal with when processing other events.
Use logic to avoid stupid scenarios where your app might go to sleep for 10 ms and then have to wake up again for the next event. For the kind of platform mentioned it shouldn't matter if both events are processed at the same time.
Having your own timer & callback mechanism might be appropriate for this kind of decision making. The trade off is in code complexity and maintenance vs. likely power savings.
Simply put, do as little as possible.
Well, to the extent that your code can execute entirely in the processor cache, you'll have less bus activity and save power. To the extent that your program is small enough to fit code+data entirely in the cache, you get that benefit "for free". OTOH, if your program is too big, and you can divide your programs into modules that are more or less independent of the other, you might get some power saving by dividing it into separate programs. (I suppose it's also possible to make a toolchain that spreas out related bundles of code and data into cache-sized chunks...)
I suppose that, theoretically, you can save some amount of unnecessary work by reducing the number of pointer dereferencing, and by refactoring your jumps so that the most likely jumps are taken first -- but that's not realistic to do as a programmer.
Transmeta had the idea of letting the machine do some instruction optimization on-the-fly to save power... But that didn't seem to help enough... And look where that got them.
Set unused memory or flash to 0xFF not 0x00. This is certainly true for flash and eeprom, not sure about s or d ram. For the proms there is an inversion so a 0 is stored as a 1 and takes more energy, a 1 is stored as a zero and takes less. This is why you read 0xFFs after erasing a block.
Rather timely this, article on Hackaday today about measuring power consumption of various commands:
Hackaday: the-effect-of-code-on-power-consumption
Aside from that:
- Interrupts are your friends
- Polling / wait() aren't your friends
- Do as little as possible
- make your code as small/efficient as possible
- Turn off as many modules, pins, peripherals as possible in the micro
- Run as slowly as possible
- If the micro has settings for pin drive strengh, slew rate, etc. check them & configure them, the defaults are often full power / max speed.
- returning to the article above, go back and measure the power & see if you can drop it by altering things.
also something that is not trivial to do is reduce precision of the mathematical operations, go for the smallest dataset available and if available by your development environment pack data and aggregate operations.
knuth books could give you all the variant of specific algorithms you need to save memory or cpu, or going with reduced precision minimizing the rounding errors
also, spent some time checking for all the embedded device api - for example most symbian phones could do audio encoding via a specialized hardware
Do your work as quickly as possible, and then go to some idle state waiting for interrupts (or events) to happen. Try to make the code run out of cache with as little external memory traffic as possible.
On Linux, install powertop to see how often which piece of software wakes up the CPU. And follow the various tips that the powertop site links to, some of which are probably applicable to non-Linux, too.
http://www.lesswatts.org/projects/powertop/
Choose efficient algorithms that are quick and have small basic blocks and minimal memory accesses.
Understand the cache size and functional units of your processor.
Don't access memory. Don't use objects or garbage collection or any other high level constructs if they expands your working code or data set outside the available cache. If you know the cache size and associativity, lay out the entire working data set you will need in low power mode and fit it all into the dcache (forget some of the "proper" coding practices that scatter the data around in separate objects or data structures if that causes cache trashing). Same with all the subroutines. Put your working code set all in one module if necessary to stripe it all in the icache. If the processor has multiple levels of cache, try to fit in the lowest level of instruction or data cache possible. Don't use floating point unit or any other instructions that may power up any other optional functional units unless you can make a good case that use of these instructions significantly shortens the time that the CPU is out of sleep mode.
etc.
Don't poll, sleep
Avoid using power hungry areas of the chip when possible. For example multipliers are power hungry, if you can shift and add you can save some Joules (as long as you don't do so much shifting and adding that actually the multiplier is a win!)
If you are really serious,l get a power-aware debugger, which can correlate power usage with your source code. Like this

What standard techniques are there for using cpu specific features in DLLs?

Short version: I'm wondering if it's possible, and how best, to utilise CPU specific
instructions within a DLL?
Slightly longer version:
When downloading (32bit) DLLs from, say, Microsoft it seems that one size fits all processors.
Does this mean that they are strictly built for the lowest common denominator (ie. the
minimum platform supported by the OS)?
Or is there some technique that is used to export a single interface within the DLL but utilise
CPU specific code behind the scenes to get optimal performance? And if so, how is it done?
I don't know of any standard technique but if I had to make such a thing, I would write some code in the DllMain() function to detect the CPU type and populate a jump table with function pointers to CPU-optimized versions of each function.
There would also need to be a lowest common denominator function for when the CPU type is unknown.
You can find current CPU info in the registry here:
HKEY_LOCAL_MACHINE\HARDWARE\DESCRIPTION\System\CentralProcessor
The DLL is expected to work on every computer WIN32 runs on, so you are stuck to the i386 instruction set in general. There is no official method of exposing functionality/code for specific instruction sets. You have to do it by hand and transparently.
The technique used basically is as follows:
- determine CPU features like MMX, SSE in runtime
- if they are present, use them, if not, have fallback code ready
Because you cannot let your compiler optimise for anything else than i386, you will have to write the code using the specific instruction sets in inline assembler. I don't know if there are higher-language toolkits for this. Determining the CPU features is straight forward, but could also need to be done in assembler.
An easy way to get the SSE/SSE2 optimizations is to just use the /arch argument for MSVC. I wouldn't worry about fallback--there is no reason to support anything below that unless you have a very niche application.
http://msdn.microsoft.com/en-us/library/7t5yh4fd.aspx
I believe gcc/g++ have equivalent flags.
Intel's ICC can compile code twice, for different architectures. That way, you can have your cake and eat it. (OK, you get two cakes - your DLL will be bigger). And even MSVC2005 can do it for very specific cases (E.g. memcpy() can use SSE4)
There are many ways to switch between different versions. A DLL is loaded, because the loading process needs functions from it. Function names are converted into addresses. One solution is to let this lookup depend on not just function name, but also processor features. Another method uses the fact that the name to address function uses a table of pointers in an interim step; you can switch out the entire table. Or you could even have a branch inside critical functions; so foo() calls foo__sse4 when that's faster.
DLLs you download from Microsoft are targeted for the generic x86 architecture for the simple reason that it has to work across all the multitude of machines out there.
Until the Visual Studio 6.0 time frame (I do not know if it has changed) Microsoft used to optimize its DLLs for size rather than speed. This is because the reduction in the overall size of the DLL gave a higher performance boost than any other optimization that the compiler could generate. This is because speed ups from micro optimization would be decidedly low compared to speed ups from not having the CPU wait for the memory. True improvements in speed come from reducing I/O or from improving the base algorithm.
Only a few critical loops that run at the heart of the program could benefit from micro optimizations simply because of the huge number of times they are invoked. Only about 5-10% of your code might fall in this category. You could rest assured that such critical loops would already be optimized in assembler by the Microsoft software engineers to some level and not leave much behind for the compiler to find. (I know it's expecting too much but I hope they do this)
As you can see, there would be only drawbacks from the increased DLL code that includes additional versions of code that are tuned for different architectures when most of this code is rarely used / are never part of the critical code that consumes most of your CPU cycles.

Resources