Why is _text address in Linux ARM64 boot so high? - linux-kernel

I'm currently learning about ARM64 architecture, and I have trouble understanding how Linux can boot. From the linker script, we can see that:
SECTIONS {
. = PAGE_OFFSET + TEXT_OFFSET;
.head.text : {
_text = .;
HEAD_TEXT
}
the initial _text symbol is very high, something like 0xfffffffc000080000. To my knowledge, this address is a physical address, but this seems impossible on a 8Gb RAM based board. How is this possible ? Am I missing something ?
EDIT: The Linux documentation states that:
- Caches, MMUs
The MMU must be off.

Related

Explicitly set starting stack pointer with linker script

I'd like to create a program with a special section at the end of Virtual Memory. So I wanted to do a linker script something like this:
/* ... */
.section_x 0xffff0000 : {
_start_section_x = .;
. = . + 0xffff;
_end_section_x = .;
}
The problem is that gcc/ld/glibc seem to load the stack at this location by default for a 32 bit application, even if it overlaps a known section. The above code zero's out the stack causing an exception. Is there any way to tell the linker to use another VM memory location for the stack? (As well, I'd like to ensure the heap doesn't span this section of virtual memory...).
I hate answers that presume or even ask if the question is wrong but, if you need a 64k segment, why can't you just allocate one on startup?
Why could you possibly need a fixed address within your process address space? I've been doing a lot of different kinds of coding for almost 30 years, and I haven't seen the need for a fixed address since the advent of protected memory.

How to port a project from one STMFx series to another in the same Series

I'm a little confused about the language/lingo/terminology of the ST line of MCUs in general, and I think this is stopping me from progressing.
A little background: I'm an EE who learned all I know about FW through compulsory college courses that used the AVR platform. Loved it, very simple and easy to use.
A quick look through the one and only datasheet, and bang you're abstracting away! Macro's, pound defines, etc...It was so simple! You write a main.c and a Makefile, compile and fire away with avr-dude...life is good.
Then, I walked into the thresher of the STM32 ARM core, and holy smokes...STDPerifieral Libraries, HAL layer, CubeMX, assembly start up files, CMSIS, IDE specific project files, and a dozen different types of just the STM32F4...I have been quickly overwhelmed!
My real problem has been that I've got this STM32F411 Discovery Board, and I need to get an 20x4 Character LCD running. I found a couple of decent looking projects on github and the like:
https://stm32f4-discovery.net/2014/06/library-16-interfacing-hd44780-lcd-controller-with-stm32f4/
https://github.com/EarToEarOak/STM32F4-HD44780
https://github.com/6thimage/hd44780
https://github.com/mirkoggl/STM32F4_LiquidCrystal
But, I cannot figure out how to get any of these compiled correctly.
I've been using the CubeMx to generate a Makefile as a starting point, mostly because it makes a linker script and the .s file that I apparently need, but have absolutely no experience or idea on how to do this on my own. That's my addiction to the CubeMX.
I think most of my issues are that I am getting a lot of conflicts by trying to glue the MCU specific generated code by CubeMX and the various online project code together...Either the online projects are calling some .h file or referencing something that the Makefile can't find, and that's where things are going off the rails....
For example, some of these projects are calling the stm32f4xx.h file, but CubeMX doesn't include that in it's generated code.
I had to go elsewhere in the system and copy that file to the ./Inc folder, but that seems really backwards...And, it still didn't build.
If I'm going to keep my head above water here, I need a AVR to STM32 12-step rehab program or something...
So, the big question is:
How would you guys take one of these projects from the first git clone xxxx and build it to run on an STM32F411 Discovery Board using arm-none-eabi and Makefiles only...No IDE's allowed.
And, what are the most important differences to understand between developing for the 8-bit AVR's and the STM32F4?
With what limited experience and the amount of novice blunders I'm making, I think the safest path forward is to make a project with CubeMX so that everything is at least set up write, and then carefully try to add the source files in by hand? But that still doesn't resolve what I'm fundementally misunderstanding about what a sane workflow with these devices should be if I'm using Unix as an IDE.
Thanks!
So a very broad question. first and formost what exactly was your avr experience, clearly if you started at main.c then someone else built your sandbox/tool environment for you including the bootstrap. The avr is more harvard-ish than arm so actually it is more of a PITA to truly build for without someone doing the work for you.
There is no reason why the experience cannot be exactly the same. You can take an avr document read about the registers for some peripheral poke at the registers for that peripheral in your program and make it work. You can take the st document, read about the registers fro some peripheral, poke at the registers for that peripheral in your program and make it work.
You can take some library like arduino for your avr, read about the api calls, write a program that makes some calls, program the device. Can take some library for your st chip and do the same thing, api calls wont be the same. arduino libraries are not the same api calls as other avr libraries made by atmel or some other party. You could jump on mbed.org and start writing api calls for your or some st chips and poof a working binary.
All mcu chip vendors need to provide libraries as not everyone is willing or (so they think) able to bare metal their way through (not that the libraries are not bare metal but are just api calls so system like). they wouldnt survive in this day and age. Likewise you have to have newer better with some new name so the libraries change. ARM is wonderful but at the same time a royal PITA. Because they make cores not chips, and their cores are surrounded by different stuff. there are a ton of cortex-m4 solutions that use the same cortex-m4 core, but you cant write one cortex-m4 program that works on all of them as the chip vendor specific stuff is all different (sure a massive if-then-else program would work if you could get it to fit). ARM is trying to make a magic layer that looks the same ish and the vendors are being dragged along so this CMSIS and MBED are ARM driven notions to solve this problem. AVR doesnt have this problem, the core and the chip vendor are one in the same. Now there are a number of different AVR cores and even if you had the same peripheral you might not be able to write one binary that works across all of them and or there might be binaries for one (xmega) that dont work on another (tiny) because of differences in the avr core implementation even if the peripherals were the same. the avr problem is much smaller than the arm problem though, but all contained within one company. so the peripherals if they vary are not going to vary nearly as much as atmel vs nxp vs st vs ti vs others using the same arm cores (or at least same named, there are items in the arm source that are easy to modify, with or without floating point with 16 bit fetches or 32 bit fetches, etc, documented in the trm for that core, creating more pain).
within a company like ST across just the cortex-ms they have created different peripherals a number of timers, gpio, uart, pll/clock, etc. If you take the bare metal, talk to the registers no other libraries approach, you will see that the transition from the cortex-m3 to the cortex-m0 they started using a different gpio peripheral, but maybe some of the others were the same. Fast forward to today we have cortex-m3, cortex-m0, cortex-m0+, cortex-m4 and cortex-m7 based devices just from ST, and there are peripherals that mix and match one product line may have a timer similar to an early cortex-m3 product but the gpio that is more like the first cortex-m0 product. And they seem to mix and match each new family they create from a pool of peripherals. So I might have code for a specific chip that works great on some other specific chip, same family or maybe even a little different. But move that to another st stm32 family and maybe the uart code works but the gpio doesnt, take that first program and move it to yet another family and maybe the gpio works and the uart doesnt. Once you have a library of your own for each of the different peripherals you can mix and match that and their libraries attempt to do that, and use a common-ish ideally call such that the code ports a little better but you have to build for the different chip to get the different stuff linked in. Not perfect.
Also look at how old say the very popular thanks to arduino and perhaps avr-freaks before that atmega328p that thing is relatively ancient. In the time since that came out all the stm32s were created, and for various reasons size/speed/internal politics/etc different peripheral choices were created. All the above mentioned cortex-m variations were created with different target use cases within the mcu world in the time that the atmega328p didnt change.
So if you take one avr document and one avr, and have a somewhat working toolchain, you can write programs straight from the avr docs. If you take one stm32 document for one stm32 chip, take just about any of the gcc cross compilers for arm (arm-none-eabi, arm-linux-gnueabi, etc), do your own boostrap and linker script which are pretty trivial for the cortex-m, you can write programs straight from the stm32 docs and the arm docs, no problems both are somewhat well written. repeat for a different stm32 chip, repeat for a cortex-m based nxp chip, repeat for a cortex-m based ti chip.
As a programmer though you want to write one program for two chips you need to look at the two chips and see what is common and different and design your program to either avoid the differences or if-then-else or use a link time if-then-else solution.
I find the libraries from the chip vendors harder to use than just reading the docs and programming registers. YMMV. I recommend you continue to attempt to use their old and their new libraries and try going directly to the registers and find what fits you best. the arms will run circles around the avrs at similar prices, power, etc. so there is value in attempting to use these other parts. avr, msp430, etc have become also-rans to cortex-m based products so professionally you should dig into one or more.
so for example a simple led blinker for the stm32f411 with an led on pa5
flash.s
.thumb
.thumb_func
.global _start
_start:
stacktop: .word 0x20001000
.word reset
.word hang
.word hang
.word hang
.word hang
.word hang
.word hang
.word hang
.word hang
.word hang
.word hang
.word hang
.word hang
.word hang
.word hang
.thumb_func
reset:
bl notmain
b hang
.thumb_func
hang: b .
.align
.thumb_func
.globl PUT16
PUT16:
strh r1,[r0]
bx lr
.thumb_func
.globl PUT32
PUT32:
str r1,[r0]
bx lr
.thumb_func
.globl GET32
GET32:
ldr r0,[r0]
bx lr
.thumb_func
.globl dummy
dummy:
bx lr
.end
notmain.c
void PUT32 ( unsigned int, unsigned int );
unsigned int GET32 ( unsigned int );
void dummy ( unsigned int );
#define RCCBASE 0x40023800
#define RCC_AHB1ENR (RCCBASE+0x30)
#define GPIOABASE 0x40020000
#define GPIOA_MODER (GPIOABASE+0x00)
#define GPIOA_OTYPER (GPIOABASE+0x04)
#define GPIOA_OSPEEDR (GPIOABASE+0x08)
#define GPIOA_PUPDR (GPIOABASE+0x0C)
#define GPIOA_BSRR (GPIOABASE+0x18)
#define STK_CSR 0xE000E010
#define STK_RVR 0xE000E014
#define STK_CVR 0xE000E018
static void led_init ( void )
{
unsigned int ra;
ra=GET32(RCC_AHB1ENR);
ra|=1<<0; //enable GPIOA
PUT32(RCC_AHB1ENR,ra);
ra=GET32(GPIOA_MODER);
ra&=~(3<<10); //PA5
ra|=1<<10; //PA5
PUT32(GPIOA_MODER,ra);
ra=GET32(GPIOA_OTYPER);
ra&=~(1<<5); //PA5
PUT32(GPIOA_OTYPER,ra);
ra=GET32(GPIOA_OSPEEDR);
ra|=3<<10; //PA5
PUT32(GPIOA_OSPEEDR,ra);
//pupdr
ra=GET32(GPIOA_PUPDR);
ra&=~(3<<10); //PA5
PUT32(GPIOA_PUPDR,ra);
}
static void led_on ( void )
{
PUT32(GPIOA_BSRR,((1<<5)<<0));
}
static void led_off ( void )
{
PUT32(GPIOA_BSRR,((1<<5)<<16));
}
void do_delay ( unsigned int sec )
{
unsigned int ra,rb,rc,rd;
rb=GET32(STK_CVR);
for(rd=0;rd<sec;)
{
ra=GET32(STK_CVR);
rc=(rb-ra)&0x00FFFFFF;
if(rc>=16000000)
{
rb=ra;
rd++;
}
}
}
int notmain ( void )
{
unsigned int rx;
led_init();
PUT32(STK_CSR,0x00000004);
PUT32(STK_RVR,0xFFFFFFFF);
PUT32(STK_CSR,0x00000005);
for(rx=0;rx<5;rx++)
{
led_on();
while(1) if(GET32(STK_CVR)&0x200000) break;
led_off();
while(1) if((GET32(STK_CVR)&0x200000)==0) break;
}
while(1)
{
led_on();
do_delay(10);
led_off();
do_delay(1);
}
return(0);
}
flash.ld
MEMORY
{
rom : ORIGIN = 0x08000000, LENGTH = 0x1000
ram : ORIGIN = 0x20000000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > rom
.rodata : { *(.rodata*) } > rom
.bss : { *(.bss*) } > ram
}
then compile
arm-none-eabi-as --warn --fatal-warnings -mcpu=cortex-m4 flash.s -o flash.o
arm-none-eabi-gcc -Wall -Werror -O2 -nostdlib -nostartfiles -ffreestanding -mthumb -mcpu=cortex-m4 -c notmain.c -o notmain.o
arm-none-eabi-ld -o notmain.elf -T flash.ld flash.o notmain.o
arm-none-eabi-objdump -D notmain.elf > notmain.list
arm-none-eabi-objcopy notmain.elf notmain.bin -O binary
can replace with arm-whatever-linux-gnueabi and will still build and run.
there are some toolchains that add extra stuff when they see main() so historically I personally avoid that. I dont use .data on mcus/etc that boot off of flash so dont have to copy that over (burns .text yes), and I dont assume .bss is zero I initialize things before I use them. So I can cheat on my bootstrap, but there are many examples of slightly more complicated linker scripts that give you bss offset and size and .data offset and size and destination to take this to the next level if you wish to have your C more pure. I also prefer to control the instruction used for loads and stores, some/many hope the compiler chooses correctly, have seen that fail. YMMV. so there is a lot of personal style, but take your stm32f11 document and look at those addresses and registers and even if you completely hate my code or style you should still see how easy it was to use that peripheral. Likewise the timers are in the arm docs. These days st being one of them some vendors have their own versions of the arm docs in a modified form so much of the arm info is covered but still some gaps.
As a general rule for arm figure out from the chip vendors documentation what arm core you have in your chip. Then go to arms site and find the technical reference manual (TRM) for that core (cortex-m4 in this case). then in that document arm mentions that is architecture armv7-m get the architectural reference manual for armv7-m. These three documents are your primary sources of information, after your first cortex-m4 you probably only need the chip vendor docs 99% of the time, certainly within a chip vendor. Also find the cpuid registers or chip id or whatever that doc calls it and compare it to what you read out of the chip/arm core. Sometimes there are different versions of the arm core (r1p3 meaning revision 1.3) and rare but happens a change between revisions means using the newest doc against an older core can result in subtle differences. again arm and arm based chips are improving/changing wayy faster than atmel avr based ones. After the first or second you get the hang of it...
If you were going to use a PIC32 for example it would be a similar story microchip for the pic32 docs, then off to mips for the core docs (and then wishing that microchip documented their peripherals even as half as good as atmel (which they now own), ti, st, nxp, etc). Another mixing of buy a processor core and wrap my own stuff around it. the pic32s are painful to program this way, really need libraries buried in the microchip toolchain, and use a ton more power, etc. no reason why mips based shouldnt be able to compete with arm based, but they dont...mips or chip vendors or a combination are at fault there.

why do_anonymous_page() call pte_unmap() just after set a new pte entry?(linux kerenl 2.6.11 memory management)

I am a newbie in linux kernel .Today i have a question about some linux kerenl 2.6.11 memory management code(plz check my code comments for my question) in do_anonymous_pages() and the code slice is following:
if (write_access) {
pte_unmap(page_table);
spin_unlock(&mm->page_table_lock);
page = alloc_page(GFP_HIGHUSER | _ _GFP_ZERO);
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, addr);
mm->rss++;
entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
vma->vm_page_prot)), vma);
lru_cache_add_active(page);
SetPageReferenced(page);
set_pte(page_table, entry); /* here just set new pte entry */
pte_unmap(page_table); /* why unmap just we set new maped PTE?? */
spin_unlock(&mm->page_table_lock);
return VM_FAULT_MINOR;
}
If you read how the page_table was populated in the first place you will see it was pte_offset_map-ed first. It should be no surprise that there is a matching pte_unmap.
The page_table thingy IS NOT the pte thingy which is set here.
Rather, on certain architectures the kernel has a very limited address space. For instance i386 is able to address 4GB of memory. This is typically split into 3GB for userspace and 1GB for kernel. But all kernel memory typically does not fit into 1GB. Thus the problem is combated by temporarily mapping and unmapping various pages where possible, as needed. page tables of userspace processes, as can be seen, are subject to this behaviour. These macros don't map/unmap anything on amd64, which has big enough address space for the kernel to permanently map all physical memory.

Page table in Linux kernel space during boot

I feel confuse in page table management in Linux kernel ?
In Linux kernel space, before page table is turned on. Kernel will run in virtual memory with 1-1 mapping mechanism. After page table is turned on, then kernel has consult page tables to translate a virtual address into a physical memory address.
Questions are:
At this time, after turning on page table, kernel space is still 1GB (from 0xC0000000 - 0xFFFFFFFF ) ?
And in the page tables of kernel process, only page table entries (PTE) in range from 0xC0000000 - 0xFFFFFFFF are mapped ?. PTEs are out of this range will be not mapped because kernel code never jump there ?
Mapping address before and after turning on page table is same ?
Eg. before turning on page table, the virtual address 0xC00000FF is mapped to physical address 0x000000FF, then after turning on page table, above mapping does not change. virtual address 0xC00000FF is still mapped to physical address 0x000000FF. Different thing is only that after turning on page table, CPU has consult the page table to translate virtual address to physical address which no need to do before.
The page table in kernel space is global and will be shared across all process in the system including user process ?
This mechanism is same in x86 32bit and ARM ?
The following discussion is based on 32-bit ARM Linux, and version of kernel source code is 3.9
All your questions can be addressed if you go through the procedure of setting up the initial page table(which will be overwitten later by function paging_init ) and turning on MMU.
When kernel is first launched by bootloader, Assembly function stext(in arch\arm\kernel\head.s) is the first function to run. Note that MMU has not been turned on yet at this moment.
Among other things, the two import jobs done by this function stext is:
create the initial page tabel(which will be overwitten later by
function paging_init )
turn on MMU
jump to C part of kernel initialization code and carry on
Before delving into the your questions, it is benificial to know:
Before MMU is turned on, every address issued by CPU is physical
address
After MMU is turned on, every address issued by CPU is virtual address
A proper page table should be set up before turning on MMU, otherwise your code will simply "be blown away"
By convention, Linux kernel uses higher 1GB part of virtual address and user land uses the lower 3GB part
Now the tricky part:
First trick: using position-independent code.
Assembly function stext is linked to address "PAGE_OFFSET + TEXT_OFFSET"(0xCxxxxxxx), which is a virtual address, however, since MMU has not been turned on yet, the actual address where assembly function stext is running is "PHYS_OFFSET + TEXT_OFFSET"(the actual value depends on your actual hardware), which is a physical address.
So, here is the thing: the program of function stext "thinks" that it is running in address like 0xCxxxxxxx but it is actually running in address (0x00000000 + some_offeset)(say your hardware configures 0x00000000 as the starting point of RAM). So before turning on MMU, the assembly code need to be very carefully written to make sure that nothing goes wrong during the execution procedure. In fact a techinque called position-independent code(PIC) is used.
To further explain the above, I extract several assembly code snippets:
ldr r13, =__mmap_switched # address to jump to after MMU has been enabled
b __enable_mmu # jump to function "__enable_mmu" to turn on MMU
Note that the above "ldr" instruction is a pseudo instruction which means "get the (virtual) address of function __mmap_switched and put it into r13"
And function __enable_mmu in turn calls function __turn_mmu_on:
(Note that I removed several instructions from function __turn_mmu_on which are essential instructions to the function but not of our interest)
ENTRY(__turn_mmu_on)
mcr p15, 0, r0, c1, c0, 0 # write control reg to enable MMU====> This is where MMU is turned on, after this instruction, every address issued by CPU is "virtual address" which will be translated by MMU
mov r3, r13 # r13 stores the (virtual) address to jump to after MMU has been enabled, which is (0xC0000000 + some_offset)
mov pc, r3 # a long jump
ENDPROC(__turn_mmu_on)
Second trick: identical mapping when setting up initial page table before turning on MMU.
More specifically, the same address range where kernel code is running is mapped twice.
The first mapping, as expected, maps address range 0x00000000(again,
this address depends on hardware config) through (0x00000000 +
offset) to 0xCxxxxxxx through (0xCxxxxxxx + offset)
The second mapping, interestingly, maps address range 0x00000000
through (0x00000000 + offset) to itself(i.e.: 0x00000000 -->
(0x00000000 + offset))
Why doing that?
Remember that before MMU is turned on, every address issued by CPU is physical address(starting at 0x00000000) and after MMU is turned on, every address issued by CPU is virtual address(starting at 0xC0000000).
Because ARM is a pipeline structure, at the moment MMU is turned on, there are still instructions in ARM's pipeine that are using (physical) addresses that are generated by CPU before MMU is turned on! To avoid these instructions to get blown up, an identical mapping has to be set up to cater them.
Now returning to your questions:
At this time, after turning on page table, kernel space is still 1GB (from 0xC0000000 - 0xFFFFFFFF ) ?
A: I guess you mean turning on MMU. The answer is yes, kernel space is 1GB(actually it also occupies several mega bytes below 0xC0000000, but that is not of our interest)
And in the page tables of kernel process, only page table entries (PTE) in range from 0xC0000000 - 0xFFFFFFFF are mapped ?. PTEs are out
of this range will be not mapped because kernel code never jump there
?
A: While the answer to this question is quite complicated because it involves lot of details regarding specific kernel configurations.
To fully answer this question, you need to read the part of kernel source code that set up the initial page table(assembly function __create_page_tables) and the function which sets up the final page table(C function paging_init).
To put it simple, there are two levels of page table in ARM, the first page table is PGD, which occupies 16KB. Kernel first zeros out this PGD during initialization process and does the initial mapping in assembly function __create_page_tables. In function __create_page_tables, only a very small portion of address space is mapped.
After that, the final page table is set up in function paging_init, and in this function, a quite large portion of address space is mapped. Say if you only have 512M RAM, for most common configurations, this 512M-RAM would be mapping by kernel code section by section(1 section is 1MB). If your RAM is quite large(say 2GB), only a portion of your RAM will be directly mapped.
(I will stop here because there are too many details regarding Question 2)
Mapping address before and after turning on page table is same ?
A: I think I've already answered this question in my explanation of "Second trick: identical mapping when setting up initial page table before turning on MMU."
4 . The page table in kernel space is global and will be shared across
all process in the system including user process ?
A: Yes and no. Yes because all processes share the same copy(content) of kernel page table(higher 1GB part). No because each process uses its own 16KB memory to store the kernel page table(although the content of page table for higher 1GB part is identical for every process).
5 . This mechanism is same in x86 32bit and ARM ?
Different Architectures use different mechanism
When Linux enables the MMU, it is only required that the virtual address of the kernel space is mapped. This happens very early in booting. At this point, there is no user space. There is no restrictions that the MMU can map multiple virtual addresses to the same physical address. So, when enabling the MMU, it is simplest to have a virt==phys mapping for the kernel code space and the mapping link==phys or the 0xC0000000 mapping.
Mapping address before and after turning on page table is same ?
If the physical code address is Oxff and the final link address is 0xc00000FF, then we have a duplicate mapping when turning on the MMU. Both 0xff and 0xc00000ff map to the same physical page. A simple jmp (jump) or b (branch) will move from one address space to the other. At this point, the virt==phys mapping can be removed as we are executing at the final destination address.
I think the above should answer points 1 through 3. Basically, the booting page tables are not the final page tables.
4 . The page table in kernel space is global and will be shared across all process in the system including user process?
Yes, this is a big win with a VIVT cache and for many other reasons.
5 . This mechanism is same in x86 32bit and ARM?
Of course the underlying mechanics are different. They are different even for different processors within these families; 486 vs P4 vs Amd-K6; ARM926 vs Cortex-A5 vs Cortex-A8, etc. However, the semantics are very similar.
See: Bootmem#lwn.net - An article on the early Linux memory phase.
Depending on the version, different memory pools and page table mappings are active during boot. The mappings we are all familiar with do not need to be in place until init runs.

Linux PowerPC Book E reserve pristine RAM through warm reboot

With CONFIG_FSL_BOOKE (P1020 RDB) 2.6.31 I need to reserve 1MB of RAM at some fixed location (doesn't matter where) that is pristine, meaning it is not touched by U-Boot or the bootmem allocator, so that the RAM contents survive warm reboot. With the caveat that I cannot change U-boot to use CONFIG_PRAM/mem=.
Compiling relocatable kernel is not an option in arc/powerpc 2.6.31. memmap is not supported in arch/powerpc/kernel/setup_32.c.
Ideally this area should be reserved and not L1 d-cached, so that it can be used to store ramoops from interrupt context.
Is there any way to move _end out to 0x600000 before bootmem to create a hole that is not touched by anyone? That is, to trick the kernel into thinking that _end is farther out?
In vmlinux.lds.S I tried something like:
. = ALIGN(PAGE_SIZE);
_end = . ;
PROVIDE32 (end = .);
Changed to
. = ALIGN(PAGE_SIZE);
_start_unused_ram = .;
. = ALIGN(0x400000);
_end = . ;
PROVIDE32 (end = .);
However, the area between __bss_stop and 0x400000 was overwritten.
The best solution would be to add the area of memory as a reserved region in the device tree.
That way it will be reserved early during boot and should not be touched by the kernel.

Resources