Baremetal benchs & software

Baremetal benchs & software - embedded-linux

I'm looking on some information about bare-metal programming.
I'm working on different powerpc platforms, and currently trying to prove that some tests are not impacted by the linux kernel. These tests are pretty basic, loads and stores in asm volatile, some benchmarks as well (Coremark, Dhrystone, etc). These tests run perfectly on Linux, but i now have to test them in baremetal, an environement i don't really have experience in.
All my platforms have u-boot installed, and i'm wondering if there is such applications that would allow me to run my tests powerpc-eabi cross-compiled ? for example, would a gdbserver launched by u-boot be able to communicate via serial port, or ethernet ? Is it possible to have a busybox called by U-boot ?

Uboot is a bootloader...use it. You probably have an xmodem downloader or ymodem downloader with uboot, if push comes to shove you can turn your program into a long series of write word to memory then branch to that.
uboot will have already setup ram and the serial port, that is how you are talking with uboot anyway, so you don't have to do all of that. You won't need to configure the serial port but you will want to find out how to write a character which means poll the status register for the transmitter register to be empty then write one character to the transmit register. Repeat for every character in your string or whatever to print.
The bootstrap to your C program assuming it is C usually involves at a bare bare minimum setting up the stack pointer (which btw uboot is running so the stack is already setup you can just not do that so long as you load your program such that it doesn't collide with what uboot is doing) and then branch to your C entry point.
Depending on how you have written your high level language program (I am assuming C) then you might have to zero out the .bss area and setup the .data area, the nice thing about using a bootloader to copy a program to ram and just run it is you usually don't have to do any of this, the binary that you download and run already has the bss zeroed and .data in the right place. So it comes back to setup the stack and branch or simply branch since you may not even have to set of the stack.
Building a bare metal program is the real challenge, because you don't have a system to make system calls to, and that is a hard thing to give up and/or simulate. newlib for example makes life a bit easier as it has a very easy to replace system backend so that you can for example leave the printfs in dhrystone (vs removing them and finding a different way to output the strings as needed or output the results.
compiling to object of the C files is easy, assembling the assembly is easy, and you should be able to do that with your powerpc-eabi gcc cross compiler, the next challenge is linking, telling the linker where stuff goes. since this is likely a flat chunk of ram you can probably do something like -Ttext 0x123450000 where the number is whatever the base address is of the ram you want to use. if you have any multiplies or divides or any floats or any other gcc library functions (that replace things that your processor may or may not do or requires a wrapper to do them properly), or any libc calls then it will try to link them in. Ideally the gcc library ones are easy but depending on the cross compiler they can be a challenge, worst case take the gcc sources and build those functions yourself, or get or build a different gcc cross compiler with different target options (Generally an easy thing to do).
I highly recommend you disassemble your binary and make sure if nothing else your entry point of your bootstrap is at the beginning of the binary. use objcopy to make a binary file powerpc-...-objcopy myprog.elf -O binary myprog.bin. then use xmodem or ymodem on the uboot prompt to copy over that program and run it.
backing up. from the datasheets for the part when you look up the uart and figure out the base address you should first use the uboot prompt to write to the address of the uart transmit register write a 0x30 to that address for example and if you have the right address then before it prints the uboot prompt again after your command it should have an extra zero '0' in the output. If you cant get it to do that with a single write from the uboot command line you wont get it to work in a program of any kind you have the wrong address or you are doing something else wrong.
Then write a very small program in assembly language that outputs a character to the uart by writing to that address, then have it count to some big number depending on the speed of your processor. If you are running at 100Mhz then count to 100 million or more (or count down to zero from a few hundred million) then branch to the beginning and repeat, output, wait output, wait. build and link this tiny program and then download with xmodem or whatever and branch to it. If you can't get it to output a character every few seconds then you won't be able to progress to something more complicated.
Next small program, poll the status register, wait for the tx buffer to be empty, then write a 0x30 to the tx register. increment the register holding the 0x30 to 0x31 and that register with 0x37. branch to the wait for tx empty and output the new value 0x31, make this an infinite loop. If once you start running you don't see 01234567012345670... repeated forever without the numbers getting mangled they must be 0-7 and repeat, then you won't be able to progress to something more complicated.
Repeat the last two programs in C with a small bootstrap that branches to the C entry point, if you cant get those working you wont be able to progress any further.
Start small with any library calls you think you can't do without (printf for example) and if you can't make a simple printf("Hello World\n"); work with all the linking and system backend and such, then you won't be able to run Dhrystone and leave in its system calls.
The compiler will likely turn some of the Dhrystone into memcpy or memset calls which you will have to implement, there are hand tuned assembly versions of these most likely and your Dhrystone performance numbers can and will be hugely affected by implementation of functions like these, so you cant simply do this
void memset ( unsigned char *d unsigned char c, unsigned int len)
{
while(len--) *(d++)=c;
}
and expect any performance. You can likely grab gcc lib or gnu libc versions of these or just steal the ones from the linux build of one of these tests (disassemble and grab the asm), that way you have apples to apples...
Benchmarking is often more bogus than real, it is very easy to take the same benchmark source with the same compiler in the same environment (on linux or on bare metal, etc) and show dramatically different results by doing various simple things, different compiler options, rearranging the functions, adding a few nops in the bootstrap, etc. Anything to either build different code or take advantage of or get hurt by the cahce, etc. If you want to show bare metal being faster than on the operating system, it is likely NOT going to happen without a bit of work. You are going to need to get the i and d caches up the d cache likely requires that you get the mmu up and so on. These can all be research projects. Then you need to know how to control your compiler build, make sure optimizations are on, as mentioned add or remove nops in your bootstrap to change the alignment of tight loops in the code with respect to cache lines. ON an operating system there are interrupts and things going on, possibly you are multitasking so with bare metal you should be able to get dhrystone like tests to run at the same speed or faster than linux, if you cant it is not because linux is faster it is because you are not doing something right in your bare metal implementation.
Yes you can probably use gdb to talk to uboot and load programs, not sure I never use gdb, I prefer to use a dumb terminal and x or y modem or use jtag with the openocd terminal (telnet into openocd rather than gdb in).

You could try compile the Benchmarks together with u-boot. So that after u-boot finishes loading it loads your program. I know that was possible for ARM platforms.
I don't whether toolchains exist for powerpc bare metal development

At https://cirosantilli.com/linux-kernel-module-cheat/#dhrystone in this commit I have provided a minimal runnable Dhrystone baremetal example with Newlib on ARM that runs on QEMU and gem5. With this starting point, it should not be hard to port it to PowerPC or other ISAs and real platforms.
In that setup, Newlib implements everything except syscalls themselves as described at: https://electronics.stackexchange.com/questions/223929/c-standard-libraries-on-bare-metal/400077#400077 which makes it much easier to use larger subsets of the C standard library.
And I use newlib through a toolchain built with crosstool-NG.
Some key files in that setup:
linker script
syscall implementations
the full make command showing some of the flags used:
make \
-j 8 \
-C /home/ciro/bak/git/linux-kernel-module-cheat/submodules/dhrystone \
CC=/home/ciro/bak/git/linux-kernel-module-cheat/out/crosstool-ng/build/default/install/aarch64/bin/aarch64-unknown-elf-gcc \
'CFLAGS_EXTRA=-nostartfiles -O0' \
'LDFLAGS_EXTRA=-Wl,--section-start=.text=0x40000000 -T /home/ciro/bak/git/linux-kernel-module-cheat/baremetal/link.ld' \
'EXTRA_OBJS=/home/ciro/bak/git/linux-kernel-module-cheat/out/baremetal/aarch64/qemu/virt/lib/bootloader.o /home/ciro/bak/git/linux-kernel-module-cheat/out/baremetal/aarch64/qemu/virt/lib/lkmc.o /home/ciro/bak/git/linux-kernel-module-cheat/out/baremetal/aarch64/qemu/virt/lib/syscalls_asm.o /home/ciro/bak/git/linux-kernel-module-cheat/out/baremetal/aarch64/qemu/virt/lib/syscalls.o' \
OUT_DIR=/home/ciro/bak/git/linux-kernel-module-cheat/out/baremetal/aarch64/qemu/virt/submodules/dhrystone \
-B \
;
Related: How to compile dhrystone benchmark for RV32I

Related

How can an Operating System be coded in high level languages?

I just started diving into the world of operating systems and I've learned that processes have a certain memory space they can address which is handled by the operating system. I don't quite understand how can an Operating System written in high level languages like c and c++ obtain this kind of memory management functionality.

You have caught the bug and there is no cure for it :-)
The language you use to write your OS has very little to do with the way your OS operates. Yes, most people use C/C++, but there are others. As for the language, you do need a language that will let you directly communicate with the hardware you plan to manage, assembly being the main choice for this part. However, this is less than 5% of the whole project.
The code that you write must not rely upon any existing operating system. i.e.: you must code all of the function yourself, or call existing libraries. However, these existing libraries must be written so that they don't rely upon anything else.
Once you have a base, you can write your OS in any language you choose, with the minor part in assembly, something a high level language won't allow. In fact, in 64-bit code, some compilers no longer allow inline assembly, so this makes that 5% I mentioned above more like 15%.
Find out what you would like to do and then find out if that can be done in the language of choice. For example, the main operating system components can be written in C, while the actual processor management (interrupts, etc) must be done in assembly. Your boot code must be in assembly as well, at least most of it.
As mentioned in a different post, I have some early example code that you might want to look at. The boot is done in assembly, while the loader code, both Legacy BIOS and EFI, are mostly C code.

To clarify fysnet's answer, the reason you have to use at least a bit of assembly is that you can only explicitly access addressable memory in C/C++ (through pointers), while hardware registers (such as the program counter or stack pointer) often don't have memory addresses. Not only that, but some registers have to be manipulated with CPU architecture-dependent special instructions, and that, too, is only possible in machine language.
I don't quite understand how can an Operating System written in high level languages like c and c++ obtain this kind of memory management functionality.
As described above, depending on the architecture, this could be achieved by having special instructions to manage the MMU, TLB etc. INVLPG is one example of such an instruction in the x86 architecture. Note that having a special instruction requiring kernel privileges is probably the simplest way to implement such a feature in hardware in a secure manner, because then it is simply sufficient to check if the CPU is in kernel mode in order to determine whether the instruction can be executed or not.

Compilers turn high-level languages into asm / machine code for you, so you don't have to write asm yourself. You pick a compiler that handles memory the way you want your OS to; e.g. using the callstack for automatic storage, and not implicitly calling malloc / free (because those won't exist in your kernel).
To link your compiled C/C++ into a kernel, you typically have to know more about the ABI it targets, and the toolchain especially the linker.
The ISO C standard treats implementation details very much as a black box. But real compilers that people use for low level stuff work in well-known ways (i.e. make the expected/useful implementation choices) that kernel programmers depend on, in terms of compiling code and static data into contiguous blocks that can be linked into a single kernel executable that can be loaded all as one chunk.
As for actually managing the system's memory, you write code yourself to do that, with a bit of inline asm where necessary for special instructions like invlpg as other answers mention.
The entry point (where execution starts) will normally be written in pure asm, to set up a callstack with the stack pointer register pointing to it.
And set up virtual memory and so on so code is executable, data is read/write, and read-only data is readable. All of this before jumping to any compiled C code. The first C you jump to is probably more kernel init code, e.g. initializing data structures for an allocator to manage all the memory that isn't already in use by static code/data.
Creating a stack and mapping code/data into memory is the kind of setup that's normally done by an OS when starting a user-space program. The asm emitted by a compiler will assume that code, static data, and the stack are all there already.

Alternative to Intel C++ compiler for Windows and OSX, which provides CPU dispatching

I'm quite honestly sick of Intel compiler now, because it's just buggy, sometimes just generates incorrect crashing code, which is especially bad, since the compilation takes like 2 hours, so there's really no way to try to get around it. Profile guided optimizations, which are needed to make executables at least reasonably sized, always generate crashing code currently for me, so...
But it has one perk no other compiler I know has - dispatching to instruction sets, which is essential for my use - signal processing. Is there any other compiler, that can do that?
(for the record I'm even ok with "pragming" every loop, that would need the CPU dispatching, and no need for nonlooped operations probably)

Why isn't the distinction between CPUs more ubiquitous?

I know that every program one writes has to eventually boil down to machine code - that's what compilers produce, that's what executable files consist of, and that's the only language that processors understand. I also know that different processors may have different instruction sets (I know 65c816 assembly, and I imagine it's vastly different from today's computers).
Here's what I'm not getting, though: If there exist different instruction sets, then why do we not seem to have to care about that every time we use software?
If a program was compiled for one particular CPU, it might not run on another - and yet, I never see notices like "Intel users download this version, AMD users download this one". I never have to even be aware of what CPU I'm on, every executable just seems to... work. The same goes for compilers, apparently - there isn't a separate version of, say, GCC, for every processor there is, right?
I'm aware that the differences in instruction sets are much more subtle than they used to be, but even then there should at least be a bit of a distinction. I'm wondering why there doesn't seem to be any.
What is it that I'm not understanding?

There actually are sometimes different versions for Intel/AMD. Even for different versions of Intel and/or AMD. That's not common (especially in the kind of software people usually use) because it's not user friendly (most people don't even know what a CPU is or does, let alone what kind they have exactly), so what often is that either the multiple versions are all in the same executable and selected at runtime, or a "least common denominator" sub-set of instructions is used (when performance is not really a concern). There are significant differences between AMD and Intel though, the most significant one is in which instruction sets they support. AMD always lags behind Intel in implementing Intels new instructions (new sets come out regularly), and Intel usually does not implement AMDs extensions (AMD64 is the big exception (99% accepted by Intel, small changes made), also a couple of instructions here and there were borrowed, but XOP will never happen even though it's awesome, 3DNow! was never adopted by Intel). As an example of software that does not package the code for different "extended instruction sets" in the same executable, see y-cruncher.
To get back to the beginning though, for some (I can't name any off the top of my head, but I've seen it before) high performance software you may see different versions each specifically tailored to get maximum performance on one specific microarchitecture. For example, P4 (netburst) and Core2 are two very different beasts (that's mostly P4's fault for being crazy), even though Core2 is backwards compatible and could run the same code unmodified, that is not necessarily a good idea from a performance perspective.

There is no Intel/AMD versions, because they use the same IS family: x86.
However, there are applications where you have to look out for different versions when you download them. There are still instruction sets that are quite different and might make a program act differently. For example if you have a PowerPC architecture and code a network based application on it, you can forget the little to big endian conversion, but if you compile the same code on x86, which is little endian, the application most likely will produce garbage on the network side.
There is also the difference in how many instructions there are, e.g. RISC vs CISC.
In the end there are a lot of differences to look for and in most programming languages you don't have to worry too much about them though, as the compiler/interpreter will handle most things for you. If you work lower lever then you have to know what you're doing on each architecture.
Also if you compile for ARM, you won't be able to run the program on any other machine, like your PC with x86. It will not work at all.
Because the op codes may/do differ, take the mov instruction, on x86 the op code is 0x88, on ARM it might be 0x13 etc.

The distinction is in fact quite dramatic. Except in the case of Intel vs. AMD. AMD makes their processors compatible with Intel machine code. On purpose of course.
Today there is a move to JIT compiling (Java, .NET, etc.). In this case, the executable file doesn't contain machine code. It contains a simple intermediate language that is compiled just before it is executed, in the machine code of the running processor. This allows the processor architecture to be completely opaque.

AMD is an intel clone. or vice versa depending on your view of the situation. Either way, there is so much in common that programs can be compiled as to run on either (within reason, cant go back 10 years for example, cant make a 32bit processor understand 64 bit specific instructions). Next step is the motherboards have to do similar things and they do, there maybe some intel or amd specific chip support items but then you get into generic peripherals that can be found on either platform or are widespread enough on one or the other platform that the operating systems and/or applications support them.

PIC on OSX's GCC

Why does the GCC on OSX 10.5 has the -fPIC option turned on by default? Afterall, doesn't it generate larger and slower code?

Unless your program has a lot of very small functions, all of which use global or static variables, or objective-c, any performance decrease or size difference will be unnoticeable. PIC isn't used for automatic local variables since they are already accessed using the stack. In functions which need it, the set up only requires four instructions, which isn't much compared to the code in the function. Each access using PIC is only one byte longer than an access without it, so again there isn't much difference.
If you are building for 64 bit, PIC will probably be smaller, and there will likely be no performance difference. The x86-64 architecture added a new instruction-relative addressing, which means there is no set up required for PIC. This new addressing mode is actually one byte shorter than encoding the absolute address in the instruction, since the SIB byte isn't used.
Finally, using PIC makes your code more secure. If your code has to be loaded at the same place every time, then someone could find the location of important functions and data and cause problems at runtime. However, if the OS can choose to load your code at a different address, anyone trying to cause problems has to find out where the functions and data structures are every time the program is run.

When should I use GCC's -pipe option?

The GCC 4.1.2 documentation has this to say about the -pipe option:
-pipe
Use pipes rather than temporary files for communication between the various stages of compilation. This fails to work on some systems where the assembler is unable to read from a pipe; but the GNU assembler has no trouble.
I assume I'd be able to tell from error message if my systems' assemblers didn't support pipes, so besides that issue, when does it matter whether I use that option? What factors should go into deciding to use it?

In our experience with a medium-sized project, adding -pipe made no discernible difference in build times. We ran into a couple of problems with it (sometimes failing to delete intermediate files if an error was encountered, IIRC), and so since it wasn't gaining us anything, we quit using it rather than trying to troubleshoot those problems.

It doesn't usually make any difference
It has + and - considerations. Historically, running the compiler and assembler simultaneously would stress RAM resources.
Gcc is small by today's standards and -pipe adds a bit of multi-core accessible parallel execution.
But by the same token the CPU is so fast that it can create that temporary file and read it back without you even noticing. And since -pipe was never the default mode, it occasionally acts up a little. A single developer will generally report not noticing the time difference.
Now, there are some large projects out there. You can check out a single tree that will build all of Firefox, or NetBSD, or something like that, something that is really big. Something that includes all of X, say, as a minor subsystem component. You may or may not notice a difference when the job involves millions of lines of code in thousands and thousands of C files. As I'm sure you know, people normally work on only a small part of something like this at one time. But if you are a release engineer or working on a build server, or changing something in stdio.h, you may well want to build the whole system to see if you broke anything. And now, every drop of performance probably counts...

Trying this out now, it looks to be moderately faster to build when the source / build destinations are on NFS (linux network). Memory usage is high though. If you never fill the RAM and have source on NFS, seems like a win with -pipe.

Honestly there is very little reason to not use it. -pipe will only use a tad more ram, which if this box is building code, I'd assume has a decent amount. It can significantly improve build time if your system is using a more conservative filesystem that writes and then deletes all the temporary files along the way (ext3, for example.)

One advantage is that with -pipe will the compiler interact less with a file system. Even when it is a ram disk does the data still need to go through the block I/O and file system layers when using temporary files whereas with a pipe it becomes a bit more direct.
With files does the compiler first need to finish writing before it can call the assembler. Another advantage of pipes is that both, the compiler and assembler, can run at the same time and it is making a bit more use of SMP architectures. Especially when the compiler needs to wait for the data from the source file (because of blocking I/O calls) can the operating system give the assembler full CPU time and let it do its job faster.

From a hardware point of view I guess you would use -pipe to preserve the lifetime of your hard drive.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio