why need linker script and startup code? - gcc

I've read this tutorial
I could follow the guide and run the code. but I have questions.
1) Why do we need both load-address and run-time address. As I understand it is because we have put .data at flash too; so why we don't run app there, but need start-up code to copy it into RAM?
http://www.bravegnu.org/gnu-eprog/c-startup.html
2) Why we need linker script and start-up code here. Can I not just build C source as below and run it with qemu?
arm-none-eabi-gcc -nostdlib -o sum_array.elf sum_array.c
Many thanks

Your first question was answered in the guide.
When you load a program on an operating system your .data section, basically non-zero globals, are loaded from the "binary" into the right offset in memory for you, so that when your program starts those memory locations that represent your variables have those values.
unsigned int x=5;
unsigned int y;
As a C programmer you write the above code and you expect x to be 5 when you first start using it yes? Well, if are booting from flash, bare metal, you dont have an operating system to copy that value into ram for you, somebody has to do it. Further all of the .data stuff has to be in flash, that number 5 has to be somewhere in flash so that it can be copied to ram. So you need a flash address for it and a ram address for it. Two addresses for the same thing.
And that begins to answer your second question, for every line of C code you write you assume things like for example that any function can call any other function. You would like to be able to call functions yes? And you would like to be able to have local variables, and you would like the variable x above to be 5 and you might assume that y will be zero, although, thankfully, compilers are starting to warn about that. The startup code at a minimum for generic C sets up the stack pointer, which allows you to call other functions and have local variables and have functions more than one or two lines of code long, it zeros the .bss so that the y variable above is zero and it copies the value 5 over to ram so that x is ready to go when the code your entry point C function is run.
If you dont have an operating system then you have to have code to do this, and yes, there are many many many sandboxes and toolchains that are setup for various platforms that already have the startup and linker script so that you can just
gcc -O myprog.elf myprog.c
Now that doesnt mean you can make system calls without a...system...printf, fopen, etc. But if you download one of these toolchains it does mean that you dont actually have to write the linker script nor the bootstrap.
But it is still valuable information, note that the startup code and linker script are required for operating system based programs too, it is just that native compilers for your operating system assume you are going to mostly write programs for that operating system, and as a result they provide a linker script and startup code in that toolchain.

1) The .data section contains variables. Variables are, well, variable -- they change at run time. The variables need to be in RAM so that they can be easily changed at run time. Flash, unlike RAM, is not easily changed at run time. The flash contains the initial values of the variables in the .data section. The startup code copies the .data section from flash to RAM to initialize the run-time variables in RAM.
2) Linker-script: The object code created by your compiler has not been located into the microcontroller's memory map. This is the job of the linker and that is why you need a linker script. The linker script is input to the linker and provides some instructions on the location and extent of the system's memory.
Startup code: Your C program that begins at main does not run in a vacuum but makes some assumptions about the environment. For example, it assumes that the initialized variables are already initialized before main executes. The startup code is necessary to put in place all the things that are assumed to be in place when main executes (i.e., the "run-time environment"). The stack pointer is another example of something that gets initialized in the startup code, before main executes. And if you are using C++ then the constructors of static objects are called from the startup code, before main executes.

1) Why do we need both load-address and run-time address.
While it is in most cases possible to run code from memory mapped ROM, often code will execute faster from RAM. In some cases also there may be a much larger RAM that ROM and application code may compressed in ROM, so the executable code may not simply be copied from ROM also decompressed - allowing a much larger application than the available ROM.
In situations where the code is stored on non-memory mapped mass-storage media such as NAND flash, it cannot be executed directly in any case and must be loaded into RAM by some sort of bootloader.
2) Why we need linker script and start-up code here. Can I not just build C source as below and run it with qemu?
The linker script defines the memory layout of you target and application. Since this tutorial is for bare-metal programming, there is no OS to handle that for you. Similarly the start-up code is required to at least set an initial stack-pointer, initialise static data, and jump to main. On an embedded system it is also necessary to initialise various hardware such as the PLL, memory controllers etc.

Related

Can gcc be configured to compile position-independent code for the code but position-dependent code for the data?

I'm trying to build bootable code for an ARM M7-based embedded system that is able to execute in place at two different locations in the QSPI, so that if one version gets corrupted, the backup version of the image can be executed in a different place.
Compiling with -fpic seems to produce a relocatable code image that is (nearly) able to execute in both places fine. However, the problem is that the data/bss the code refers to is also getting offset by the same amount - that is, the compiler is assuming that the .data and .bss segments live immediately after the .text segment, which isn't true for XIP embedded systems (where the RAM is separate).
As a result, if the original binary was linked to run at 0x60000000 (and using a fixed ram area at 0x20000000) but is then executed in place at 0x60100000 instead , the ram addresses will be shifted by 0x100000 as well (i.e. to 0x20100000), which isn't what I want at all.
Clearly, what I'd like to do is to modify gcc's behaviour so that references to the code (executing in place in two different places in the QSPI) are position-independent, while references to the .data/bss segments (in a fixed position in RAM) are position-dependent (as per normal).
Is this something that gcc can be tweaked to achieve (e.g. by some obscure linker attribute flag)? Or is this just out of its reach? Thanks!

accessing process memory parts

I'm currently studying memory management of OS by the video lecture. The instructor says,
In fact, you may have, and it is quite often the case that there may
be several parts of the process memory, which are not even accessed at
all. That is, they are neither executed, loaded or stored from memory.
I don't understand the saying since even if in a simple C program, we access whole address space of it. Don't we?
#include <stdio.h>
int main()
{
printf("Hello, World!");
return 0;
}
Could you elucidate the saying? If possible could you provide an example program wherein "several parts of the process memory, which are not even accessed at all" when it is run.
Imagine you have a large and complicated utility (e.g. a compiler), and the user asks it for help (e.g. they type gcc --help instead of asking it to compile anything). In this case, how much of the utility's code and data is used?
Most programs have various optional parts that aren't used (e.g. maybe something that works with graphics will have some code for 16 bits per pixel and other code for 32 bits per pixel, and will determine which code to use and not use the other code). Most heap allocators are "eager" (e.g. they'll ask the OS for 20 MiB of space and then might only "malloc() 2 MiB of it). Sometimes a program will memory map a huge file but then only access a small part of it.
Even for your trivial "hello world" example code; the virtual address space probably contains a huge (several MiB) shared library to support lots of C standard library functions (e.g. puts(), fprintf(), sprintf(), ...) and your program only uses a small part of that shared library; and your program probably reserves a conservative amount of space for its stack (e.g. maybe 20 KiB of space for its stack) and then probably only uses a few hundred bytes of stack.
In a virtual memory system, the address space of the process is created in secondary store at start up. Little or nothing gets placed in memory. For example, the operating system may use the executable file as the page file for the code and static data. It just sets up an internal structure that says some range of memory is mapped to these blocks in the executable file. The same goes for shared libraries. The other data gets mapped to the page file.
As your program runs it starts page faulting rapidly because nothing is in memory and the operating system has to load it from secondary storage.
If there is something that your program does not reference, it never gets loaded into memory.
If you had global variable declared like
char somedata [1045] ;
and your program never references that variable, it will never get loaded into memory. The same goes for code. If you have pages of code that done get execute (e.g. error handling code) it does not get loaded. If you link to shared libraries, you will likely bece including a lot of functions that you never use. Likewise, they will not get loaded if you do not execute them.
To begin with, not all of the address space is backed by physical memory at all times, especially if your address space covers 248+ bytes, which your computer doesn't have (which is not to say you can't map most of the address space to a single physical page of memory, which would be of very little utility for anything).
And then some portions of the address space may be purposefully permanently inaccessible, like a few pages near virtual address 0 (to catch NULL pointer dereferences).
And as it's been pointed out in the other answers, with on-demand loading of programs, you may have some portions of the address space reserved for your program but if the program doesn't happen to need any of its code or data there, nothing needs to be loader there either.

Why ELF executables have a fixed load address?

ELF executables have a fixed load address (0x804800 on 32-bit x86 Linux binaries, and 0x40000 on 64-bit x86_64 binaries).
I read the SO answers (e.g., this one) about the historical reasons for those specific addresses. What I still don't understand is why to use a fixed load address and not a randomized one (given some range to be randomized within)?
why to use a fixed load address and not a randomized one
Traditionally that's how executables worked. If you want a randomized load address, build a PIE binary (which is really a special case of shared library that has startup code in it) with -fPIE and link with -pie flags.
Building with -fPIE introduces runtime overhead, in some cases as bad as 10% performance degradation, which may not be tolerable if you have a large cluster or you need every last bit of performance.
not sure if I understood your question correct, but saying I did, that's sort-off a "legacy" / historical issue, ELF is the file format used by UNIX derived operating systems, both POSIX (IOS) and Unix-like (Linux).
and the elf format simply states that there must be some resolved and absolute virtual address that the code is loaded into and begins running from...
and simply that's how the file format is, and during to historical reasons that cant be changed... you couldn't just "throw" the executable in any memory address and have it run successfully, back in the 90's when the ELF format was introduced problems such as calling functions with virtual tables we're raised and it was decided that the elf format would have absolute addresses within it.
Also think about it, take a look at the elf format -https://en.wikipedia.org/wiki/Executable_and_Linkable_Format
how would you design an OS executable-loader that would be able to handle an executable load it to ANY desired virtual address and have the code run successfully without actually having to change the binary itself... if you would like to do something like that you'd either need to vastly change the output compilers generate or the format itself, which again isn't possible
As time passed the requirement of position independent executing (PIE/PIC) has raised and shared objects we're introduced in order to allow that and ASLR
(Address Space Layout Randomization) - which means that the code could be thrown in any memory address and still be able to execute, that is simply implemented by making sure that all calls within the code itself are relative to the current address of the executed instruction, AND that when the shared object is loaded the OS loader would have to change some data within the binary given that the data changed is not executable instructions (R E) but actual data (RW, e.g .data segment), which also is implemented by calling functions from some "Jump tables" ( which would be changed at load time ) for example PLT / GOT.... those shared objects allow absolute randomization of the addresses the code is loaded to and if you want to execute some more "secure" code you'd have to compile it as a shared object and and dynamically link it and load time or run time..
( hope I've cleared some things out :) )

STM32F4 running FreeRTOS in external RAM

We have a thesis project at work were the guys are trying to get external RAM to work for the STM32F417 MCU.
The project is trying out some stuff that is really resource hungry and the internal RAM just isn't enough.
The question is how to best do this.
The current approach has been to just replace the RAM address in the link script (gnu ld) with the address for external RAM.
The problem with that approach is that during initialisation, the chip has to run on internal RAM since the FSMC has not been initialized.
It seems to work but as soon as pvPortMalloc is run we get a hard fault and it is probably due to dereferencing bogus addresses, we can see that variables are not initialized correctly at system init (which makes sense I guess since the internal RAM is not used at all, when it probably should be).
I realize that this is a vague question, but what is the general approach when running code in external RAM on a Cortex M4 MCU, more specifically the STM32F4?
Thanks
FreeRTOS defines and uses a single big memory area for stack and heap management; this is simply an array of bytes, the size of which is specified by the configTOTAL_HEAP_SIZE symbol in FreeRTOSConfig.h. FreeRTOS allocates tasks stack in this memory area using its pvPortMalloc function, therefore the main goal here is to place the FreeRTOS heap area into external SRAM.
The FreeRTOS heap memory area is defined in heap_*.c (with the exception of heap_3.c that uses the standard library malloc and it doesn't define any custom heap area), the variable is called ucHeap. You can use your compiler extensions to set its section. For GCC, that would be something like:
static uint8_t ucHeap[ configTOTAL_HEAP_SIZE ] __attribute__ ((section (".sram_data")));
Now we need to configure the linker script to place this custom section into external SRAM. There are several ways to do this and it depends again on the toolchain you're using. With GCC one way to do this would be to define a memory region for the SRAM and a section for ".sram_data" to append to the SRAM region, something like:
MEMORY
{
...
/* Define SRAM region */
sram : ORIGIN = <SRAM_START_ADDR>, LENGTH = <SRAM_SIZE>
}
SECTIONS
{
...
/* Define .sram_data section and place it in sram region */
.sram_data :
{
*(.sram_data)
} >sram
...
}
This will place the ucHeap area in external SRAM, while all the other text and data sections will be placed in the default memory regions (internal flash and ram).
A few notes:
make sure you initialize the SRAM controller/FSMC prior to calling any FreeRTOS function (like xTaskCreate)
once you start the tasks, all stack allocated variables will be placed in ucHeap (i.e. ext RAM), but global variables are still allocated in internal RAM. If you still have internal RAM size issues, you can configure other global variables to be placed in the ".sram_section" using compiler extensions (as shown for ucHeap)
if your code uses dynamic memory allocation, make sure you use pvPortMalloc/vPortFree, instead of the stdlib malloc/free. This is because only pvPortMalloc/vPortFree will use the ucHeap area in ext RAM (and they are thread-safe, which is a plus)
if you're doing a lot of dynamic task creation/deletion and memory allocation with pvPortMalloc/vPortFree with different memory block sizes, consider using heap_4.c instead of heap_2.c. heap_2.c has memory fragmentation problems when using several different block sizes, whereas heap_4.c is able to combine adjacent free memory blocks into a single large block
Another (and possibly simpler) solution would be to define the ucHeap variable as a pointer instead of an array, like this:
static uint8_t * const ucHeap = <SRAM_START_ADDR>;
This wouldn't require any special linker script editing, everything can be placed in the default sections. Note that with this solution the linker won't explicitly reserve any memory for the heap and you will loose some potentially useful information/errors (like heap area not fitting in ext RAM). But as long as you only have ucHeap in external RAM and you have configTOTAL_HEAP_SIZE smaller than external RAM size, that might work just fine.
When the application starts up it will try to initialise data by either clearing it to zero, or initialising it to a non-zero value, depending on the section the variable is placed in. Using a normal run time model, that will happen before main() is called. So you have something like:
1) Reset vector calls init code
2) C run time init code initialises variables
3) C run time init code calls main()
If you use the linker to place variables in external RAM then you need to ensure the RAM is accessible before that initialisation takes place, otherwise you will get a hard fault. Therefore you need to either have a boot loader that sets up the system for you, then starts your application....or more simply just edit the start up code to do the following:
1) Reset vector calls init code
2) >>>C run time init code configures external RAM<<<
3) C run time init code initialised variables
4) C run time init code calls main().
That way the RAM is available before you try to access it.
However, if all you want to do is have the FreeRTOS heap in external RAM, then you can leave the init code untouched, and just use an appropriate heap implementation - basically one that does not just declare a large static array. For example, if you use heap_5 then all you need to do is ensure the heap init function is called before any allocation is performed, because the heap init just describes which RAM to use as the heap, rather than statically declaring the heap.

What is a TRAMPOLINE_ADDR for ARM and ARM64(aarch64)?

I am writing a basic check-pointing mechanism for ARM64 using PTrace in order to do so I am using some code from cryopid and I found a TRAMPOLINE_ADDR macro like the following:
#define TRAMPOLINE_ADDR 0x00800000 /* 8MB mark */ for x86
#define TRAMPOLINE_ADDR 0x00300000 /* 3MB mark */ for x86_64
So when I read about trampolines it is something related to jump statements. But my questions is from where the above values came and what would the corresponding values for the ARM and ARM64 platform.
Thank you
Just read the wikipedia page.
There is nothing magic about a trampoline or certainly a particular address, any address where you can have code that executes can hold a trampoline. there are many use cases for them...for example
say you are booting off of a flash, a spi flash, running at some safe rate so that the chip boots for all users. But you want to increase the rate of the spi flash and the spi peripheral does not allow you to change while executing code. So you would copy some code to ram, that code boosts the spi flash rate to a faster rate so you can use and/or run the flash faster, then you bounce back to running from the flash. you have bounced or trampolined off of that little bit of code in ram.
you have a chip that boots from flash, but has the ability to re-map that address space to ram for example, so you copy some code to some other ram, branch to it that little bit of trampoline code remaps the address space, then bounces you back or bounces you to where the flash is now mapped to or whatever.
you will see the gnu linker sometimes add a small trampoline, say you compile some modules as thumb and some others for arm, you no longer have to use that interwork thing, the linker takes care of cleaning this up, it may add an instruction or two to trampoline you between modes, sometimes it modifies the code to just go where it needs to sometimes it modifies the code to branch link somewhere close and that somewhere close is a trampoline.
I assume there may be a need to do the same thing for aarch64 if/when switching to that mode.
so there should be no magic. your specific application might have one or many trampolines, and the one you are interested might not even be called that, but is probably application specific, absolutely no reason why there would be one address for everyone, unless it is some very rigid operating specific (again "application specific") thing and one specific trampoline for that operating system is at some DEFINEd address.

Resources