How can I benchmark or profile an embedded ARM platform emulated? - performance

I'm developing performance sensitive code for an embedded platform. In general, there are multiple ways to test for an embedded platform, and I'm doing so by developing on a full Linux machine, using the qemu-user arm mode as an emulator. I have full unit tests working, and now want to address performance.
I'd like to profile or benchmark my code. Now, doing so directly in qemu-user is silly, because a fast op may be emulated slowly. But, in principle, qemu could tell me how many clock cycles were emulated to run a function. Even if this doesn't have a full model, or even a partial model, of cache, mem latency, etc., it will still be very useful.
Is there a way I can use qemu to tell me some sense of how code A will perform vs code B? If not, is there another tool? (I recall Intel having some type of model which will tell you how fast given asm will execute.) In general, in the absence of an embedded platform with profiling tools, how can I benchmark and profile my code for ultimate performance?

Related

Debugging an ARM assembly (Neon extension)

I am developing an algorithm that uses ARM Neon instructions. I am writing the code using assembler file (.S and no inline asm).
My question is that what is the best way for debugging purpose i.e. viewing registers, memory, etc.
Currently, I am using Android NDK to compile and my Android phone to run the algorithm.
Poor man's debug solutions...
You can use gdb / gdbserver to remotely control execution of applications on an Android phone. I'm not giving full details here because they change all the time but for example you can start with this answer or make a quick search on Internet. Learning to use GDB might seem to have a high steep curve however material on web is exhaustive. You can easily find something to your taste.
Single-stepping an ARM core via software tools is hard that's why ARM ecosystem is full of expensive tools and extra HW equipment.
Trick I use is to insert BRK instructions manually in assembly code. BRK is Self-hosted debug breakpoint. When core sees this instruction it stops and informs OS about situation. OS then notifies debugger about the situation and passes control to it. When debugger gets control you can check contents of registers and probably even make changes to them. Last part of the operation is to make your process continue. Since PC is still at our break point instruction what you must do is to increase PC, set it to instruction after BRK.
Since you mentioned you use .S files instead of .s files you can utilize gcc to do preprocessing / macro work. This way enabling, disabling BRK might become less of an issue.
Big down side of this way of working is turnaround time. If there is a certain point that you want to investigate with gdb you must make sure there is a BRK instruction there and this will probably require another build/push/debug cycle.

Test app performance by making it lag

Is there a way or an application to test performance by making the app execute slower? I want to be sure that my app will perform well on older hardware.
Just adding stalls in SW won't necessarily imitate any older HW, it would just show you how the stalled code behaves on the new HW (and if the stalls aren't properly serializing - they may actually get avoided altogether).
If you just want to see how the code behaves without some specific ISA features you can disable them on compilation, or even compile to an older architecture. That won't make your CPU run any slower of course, but it won't be able to use for example AVX/SSE vectors (in x86 for e.g.), or other dedicated instructions.
If you want on old system+OS configuration you can use emulation - for e.g. DosBox
If you want an even higher level of realism, you can find a HW simulator that models that HW, and run on that (assuming you can cross-compile your code to run on it).
And of course, if you want an even more realistic experiment, and willing to go the extra mile, just get a specimen of that old HW, wipe the dust off, and build and run on it :)

Feasibility of using the same code on both embedded and Windows platforms

We have a program written in VBA that is running on Windows machines.
We have a very similar program written in ANSI C, using a Keil IDE and compiler that is running on an STR9x uP.
Our plans were to rewrite the VBA code in .NET using C#.
What is the feasibility of writing the shared code in C++ to be used on both systems? Obviously, the .NET framework would be off limits, but that isn't much of a concern. I'm wondering, specifically, about how labor intensive you think the compilation process might be.
This is kind of a theoretical question, I know, but thanks for any thoughts.
I do this a as general practice. I think a better question than "is it possible" is "how should I structure my code to be able to run on both an embedded system and also a PC".
I prefer to write the code in C and structure each file as a c++ class using static variables to make global variables private to the module. Create getter and setter functions to access the private variables. Also use function pointers which I set at initialization of the module for the methods the module need to call outside of the module.
It is also easy to refactor from the above structured c code to a class in c# or c++.
You can also use C++ directly but using it incorrectly on an embedded system can cause problems.
You will need a hardware abstraction layer if you are accessing any hardware. I separate my code into two types the first being code that has no reference to what it is running on and other code which I refer to as drivers.
I use this code for reusing modules for things like communication protocols. But more importantly I use it for testing. I like to use gtest to unit test the modules. I can also rewrite the drivers and simulate the hardware on a PC to be able to run it on the PC.
Obviously, the .NET framework would be off limits
Not necessarily true. Given sufficient ROM and RAM resources (256K/64K respectively), the .NET Micro Framework will run on your device. However that is not necessarily a good reason to use it; there are already two other commonly used portable languages available for both your embedded target and Windows: C and C++. The target resource required for both C and C++ is minimal - C/C++ runtime start-up code can be well under 1K of code, almost all available resources can be utilised by your application code rather than the run-time environment.
The trick to utilising common code on both platforms is abstraction. This will involve at least hardware abstraction and possibly OS abstraction if your target is using any sort of kernel or scheduler such as an RTOS or thread library.
I'd recommend designing your embedded target with a layer architecture, having at least a device layer and an application layer and as mentioned already, possibly a system layer that deals with IPC, synchronisation and scheduling, if used. You may have other higher layer interfaces such as networking or filesystem that would equally benefit from abstraction. Note that standard APIs such as BSD sockets or stdio already count as abstraction, so if your target uses these, you have less work to do in Windows (minor differences between BSD Sockets and Winsock may still need some work)
The application layer will have no OS or hardware dependencies other than those accessible through the device and system layers. You must then implement the device and system layers on Windows as either a simulation or remapping to services or devices available on Windows. Some RTOS's already include Windows simulators for test and development, but defining your own OS API layer that you can port between a number of native RTOS and GPOS will allow your application code to be ported to different targets for both simulation and real-time execution very quickly.
Where the platform differences are minor and localised, and may not justify an abstraction layer, then target specific conditional compilation may be appropriate. Compilers support predefined macros for architecture, OS or compiler specific code that can be used for both this localised code and to make the abstraction layer code itself common where there is significant similarity.

How are operating systems debugged?

How are operating systems typically debugged? They cannot be stepped through with a debugger like simple console programs, and the build times are too large to repeatedly make small changes and recompile the whole thing.
They aren't debugged as a multi-gigabyte programs! :)
If you mean the individual user-mode components, they can mainly be debugged just like normal programs and libraries (because they are normal programs/libraries!).
For kernel-mode components, though, each OS has its own mechanism; here is some information regarding the way that we do kernel debugging in Windows. It can be done using the help of another machine connected to the machine you're debugging, via a serial port or something. I'm not familiar with the process itself, but that's the gist of how they work. (You need to set some boot loader options so that the system is ready for the debugger to be connected as early as possible.)
It depends on which part of the operating system you're talking about. When I worked at MSFT, I worked on the IE team. We debugged IE and the shell (Windows Explorer) in Visual Studio and stepped through them line by line all day long. Though, sometimes, it's easier to debug using a command line tool such as NTSD.
If, however, you want to debug anything in Kernel land such as the OS kernel or device drivers, which I suspect is really what you're asking, then you must use the Kernel debugger. For Windows that is a command line tool called kd, and generally you run the debugger on one machine and remotely debug the target.
There are a whole set of techniques throughout history from flashing lights on the console, to the use of hardware devices like an ICE, to more modern techniques utilizing fairly standard debuggers. One technique that is more common among OS developers then application developers is the analysis of a core dump. Look at something like mdb on solaris for ideas about how Solaris kernel developers do some of their debugging. Also tracing technologies are used. Anywhere from fairly straightforward logging packages to more modern techniques like dtrace.
Also note that the techniques used depend on the layer of software. Initial boot tends to be a fairly hard place to get your fingers into. But after that the environment of modern operation systems looks more and more like the application setting you are use to. In the end, it is all code :)

Debugging an Operating System

I was going through some general stuff about operating systems and struck on a question. How will a developer debug when developing an operating system i.e. debug the OS itself? What tools are available to debug for the OS developer?
Debugging a kernel is hard, because you probably can't rely on the crashing machine to communicate what's going on. Furthermore, the codes which are wrong are probably in scary places like interrupt handlers.
There are four primary methods of debugging an operating system of which I'm aware:
Sanity checks, together with output to the screen.
Kernel panics on Linux (known as "Oops"es) are a great example of this. The Linux folks wrote a function that would print out what they could find out (including a stack trace) and then stop everything.
Even warnings are useful. Linux has guards set up for situations where you might accidentally go to sleep in an interrupt handler. The mutex_lock function, for instance, will check (in might_sleep) whether you're in an unsafe context and print a stack trace if you are.
Debuggers
Traditionally, under debugging, everything a computer does is output over a serial line to a stable test machine. With the advent of virtual machines, you can now wire one VM's execution serial line to another program on the same physical machine, which is super convenient. Naturally, however, this requires that your operating system publish what it is doing and wait for a debugger connection. KGDB (Linux) and WinDBG (Windows) are some such in-OS debuggers. VMWare supports this story explicitly.
More recently the VM developers out there have figured out how to debug a kernel without either a serial line or kernel extensions. VMWare has implemented this in their recent stuff.
The problem with debugging in an operating system is (in my mind) related to the Uncertainty principle. Interrupts (where most of your hard errors are sure to be) are asynchronous, frequent and nondeterministic. If your bug relates to the overlapping of two interrupts in a particular way, you will not expose it with a debugger; the bug probably won't even happen. That said, it might, and then a debugger might be useful.
Deterministic Replay
When you get a bug that only seems to appear in production, you wish you could record what happened and replay it, like a security camera. Thanks to a professor I knew at Illinois, you can now do this in a VMWare virtual machine. VMWare and related folks describe it all better than I can, and they provide what looks like good documentation.
Deterministic replay is brand new on the scene, so thus far I'm unaware of any particularly idiomatic uses. They say it should be particularly useful for security bugs, too.
Moving everything to User Space.
In the end, things are still more brittle in the kernel, so there's a tremendous development advantage to following the Nucleus (or Microkernel) design, where you shave the kernel-mode components to their bare minimum. For everything else, you can use the myriad of user-space dev tools out there, and you'll be much happier. FUSE, a user-space filesystem extension, is the canonical example of this.
I like this last idea, because it's like you wrote the program to be writeable. Cyclic, no?
In a bootstrap scenario (OS from scratch), you'd probably have to introduce remote debugging capabilities (memory dumping, logging, etc.) in the OS kernel early on, and use a separate machine. Or you could use a virtual machine/hypervisor.
Windows CE has a component called KITL - Kernel Independent Transport Layer. I guess the title speaks for itslf.
You can use a VM: eg. debug ring0 code with bochs/gdb
or Debugging NetBSD kernel with qemu
or a serial line with something like KDB.
printf logging
attach to process
serious unit tests
etc..
Remote debugging with kernel debuggers, which can also be done via virtualization.
Debugging an operating system is not for the faint of heart. Because the kernel is being debugged, your options would be quite limited. Copious amount of printf statements is one trick, and furthermore, it depends on really what 'operating system' is being debugged, we could be talking about
Filesystem
Drivers
Memory management
Raw Disk input/output
Screen input/output
Kernel
Again, it is a widely varying exercise as in the above, they all interact with one another. Even more complicated is the fact, supposing you were to debug the kernel, how would you do it if the runtime environment is not properly set (by that, I am talking about the kernel's responsibility for loading binary executables).
Some kernels may (not all of them have them) incorporate a simple debug monitor, in fact, if I rightly recall, in the book titled 'Developing your own 32bit Operating System' by Richard A Burgess, Sams publishing, he incorporated a debug monitor which displays various states of the CPU, registers and so on.
Again, take into account of the fact that the binary executables require a certain loading mechanism, for example a gdb equivalent, if the environment for loading binaries are not set up, then your options are quite limited.
By using copious amount of printf statements to display errors, logs etc to a separate terminal or to a file is the best line of debugging, it does sound a nightmare but it would be worth the effort to do so.
Hope this helps,
Best regards,
Tom.

Resources