how does GCC provide a breakdown of the memory used in each memory region defined in the linker file using --print-memory-usage?
GCC just forwards --print-memory-usage to the linker, usually ld:
https://sourceware.org/binutils/docs-2.40/ld.html#index-memory-usage
gcc (or g++ for that matter) has no idea about memory usage, and the linker can only report usage of static storage memory, which is usually:
.text: The "program" or code to be executed. This might be located in RAM or in ROM (e.g. Flash) depending on options and architecture.
.rodata: Data in static storage that is read-only and does not need initialization at run-time. This is usually located in non-volatile memory like ROM or Flash; but there are exceptions, one of which is avr-gcc.
.data, .bss and COMMON: Data in RAM that's initialized during start-up by the CRT (C Runtime).
Apart from these common setcions, there might be other sections like .init*, .boot, .jumptables etc, which again depend on the application and architecture.
By its very nature, the linker (or assembler or compiler) cannot determine memory usage that unfolds at run-time, which is:
Stack usage: Non-static local variables that cannot be held in registers, alloca, ...
Heap usage: malloc, new and friends.
What the compiler can do for you is -fstack-usage and similar, which generates a text file *.su for each translation unit. The compiler reports stack usage that's known at compile time (static) and unknown stack usage that arises at run-time (dynamic). The functions marked as static use the specified amount of stack space, without counting the usages of non-inlined callees.
In order to know the complete stack usage (or a reliable upper bound), the dynamic call graph must be known. Even if it's known, GCC won't do the analysis four you. You will need other more elaborate tools to work out these metrics, e.g. by abstract interpretation or other means of static analysis.
Notice that data collected at run-time, like dynamic stack usage analysis at run time, only provide a lower bound of memory usage (or execution time for that matter). However, for sound analysis like in safety-scitical apps, what you meed are upper bounds.
Related
I just started diving into the world of operating systems and I've learned that processes have a certain memory space they can address which is handled by the operating system. I don't quite understand how can an Operating System written in high level languages like c and c++ obtain this kind of memory management functionality.
You have caught the bug and there is no cure for it :-)
The language you use to write your OS has very little to do with the way your OS operates. Yes, most people use C/C++, but there are others. As for the language, you do need a language that will let you directly communicate with the hardware you plan to manage, assembly being the main choice for this part. However, this is less than 5% of the whole project.
The code that you write must not rely upon any existing operating system. i.e.: you must code all of the function yourself, or call existing libraries. However, these existing libraries must be written so that they don't rely upon anything else.
Once you have a base, you can write your OS in any language you choose, with the minor part in assembly, something a high level language won't allow. In fact, in 64-bit code, some compilers no longer allow inline assembly, so this makes that 5% I mentioned above more like 15%.
Find out what you would like to do and then find out if that can be done in the language of choice. For example, the main operating system components can be written in C, while the actual processor management (interrupts, etc) must be done in assembly. Your boot code must be in assembly as well, at least most of it.
As mentioned in a different post, I have some early example code that you might want to look at. The boot is done in assembly, while the loader code, both Legacy BIOS and EFI, are mostly C code.
To clarify fysnet's answer, the reason you have to use at least a bit of assembly is that you can only explicitly access addressable memory in C/C++ (through pointers), while hardware registers (such as the program counter or stack pointer) often don't have memory addresses. Not only that, but some registers have to be manipulated with CPU architecture-dependent special instructions, and that, too, is only possible in machine language.
I don't quite understand how can an Operating System written in high level languages like c and c++ obtain this kind of memory management functionality.
As described above, depending on the architecture, this could be achieved by having special instructions to manage the MMU, TLB etc. INVLPG is one example of such an instruction in the x86 architecture. Note that having a special instruction requiring kernel privileges is probably the simplest way to implement such a feature in hardware in a secure manner, because then it is simply sufficient to check if the CPU is in kernel mode in order to determine whether the instruction can be executed or not.
Compilers turn high-level languages into asm / machine code for you, so you don't have to write asm yourself. You pick a compiler that handles memory the way you want your OS to; e.g. using the callstack for automatic storage, and not implicitly calling malloc / free (because those won't exist in your kernel).
To link your compiled C/C++ into a kernel, you typically have to know more about the ABI it targets, and the toolchain especially the linker.
The ISO C standard treats implementation details very much as a black box. But real compilers that people use for low level stuff work in well-known ways (i.e. make the expected/useful implementation choices) that kernel programmers depend on, in terms of compiling code and static data into contiguous blocks that can be linked into a single kernel executable that can be loaded all as one chunk.
As for actually managing the system's memory, you write code yourself to do that, with a bit of inline asm where necessary for special instructions like invlpg as other answers mention.
The entry point (where execution starts) will normally be written in pure asm, to set up a callstack with the stack pointer register pointing to it.
And set up virtual memory and so on so code is executable, data is read/write, and read-only data is readable. All of this before jumping to any compiled C code. The first C you jump to is probably more kernel init code, e.g. initializing data structures for an allocator to manage all the memory that isn't already in use by static code/data.
Creating a stack and mapping code/data into memory is the kind of setup that's normally done by an OS when starting a user-space program. The asm emitted by a compiler will assume that code, static data, and the stack are all there already.
The Extended Asm manual https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html says the following about the "memory" clobber:
The "memory" clobber tells the compiler that the assembly code performs memory reads or writes to items other than those listed in the input and output operands (for example, accessing the memory pointed to by one of the input parameters). To ensure memory contains correct values, GCC may need to flush specific register values to memory before executing the asm. Further, the compiler does not assume that any values read from memory before an asm remain unchanged after that asm; it reloads them as needed. Using the "memory" clobber effectively forms a read/write memory barrier for the compiler.
I am confused about the decision to flush to memory. Before the asm code, how would GCC know if a register serves as a cache for a memory location, and thus needs to be flushed to memory? And is this part of cache coherency (I thought cache coherency was a hardware behavior)? After the asm code, how does GCC distinguish a register as a cache and, next time the register is read, decide instead to read from memory as the cache may be old?
Before the asm code, how would GCC know if a register serves as a
cache for a memory location, and thus needs to be flushed to memory?
Because GCC is the one who generates this code.
Generally, from GCC's perspective:
[C code to compile]
[your inline asm with clobber]
[C code to compile]
GCC generates the assembly instructions prior and after your inline asm, hence it knows everything before and after it. Now, since the memory clobber means sw memory barrier, the following applies:
[GCC generated asm]
[compiler memory barrier]
[GCC generated asm]
So GCC generates the assembly before and after the barrier, and it knows that it cannot have memory accesses crossing the memory barrier. Basically, from GCC's eyes, there is code to compile, then memory barrier, then more code to compile, and that's it, the only restriction the memory barrier applies here is that GCC generated code must not have memory accesses crossing this barrier.
So if, for example, GCC loads a register with a value from memory, change it, and store it back to memory, the load and store cannot cross the barrier. Depending on the code, they must reside before or after the barrier (or twice, on both sides).
I would recommend you reading this related SO thread.
I've been reading on compiler optimizations vs CPU optimizations, and volatile vs memory barriers.
One thing which isn't clear to me is that my current understanding is that CPU optimizations and compiler optimizations are orthogonal. I.e. can occur independently of each other.
However, the article volatile considered harmful makes the point that volatile should not be used. Linus's post makes similar claims. The main reasoning, IIUC, is that marking a variable as volatile disables all compiler optimizations when accessing that variable (i.e. even if they are not harmful), while still not providing protection against memory reorderings. Essentially, the main point is that it's not the data that should be handled with care, but rather a particular access pattern needs to be handled with care.
Now, the volatile considered harmful article gives the following example of a busy loop waiting for a flag:
while (my_variable != what_i_want) {}
and makes the point that the compiler can optimize the access to my_variable so that it only occurs once and not in a loop. The solution, so the article claims, is the following:
while (my_variable != what_i_want)
cpu_relax();
It is said that cpu_relax acts as a compiler barrier (earlier versions of the article said that it's a memory barrier).
I have several gaps here:
1) Is the implication that gcc has special knowledge of the cpu_relax call, and that it translates to a hint to both the compiler and the CPU?
2) Is the same true for other instructions such as smb_mb() and the likes?
3) How does that work, given that cpu_relax is essentially defined as a C macro? If I manually expand cpu_relax will gcc still respect it as a compiler barrier? How can I know which calls are respected by gcc?
4) What is the scope of cpu_relax as far as gcc is concerned? In other words, what's the scope of reads that cannot be optimized by gcc when it sees the cpu_relax instruction? From the CPU's perspective, the scope is wide (memory barriers place a mark in the read or write buffer). I would guess gcc uses a smaller scope - perhaps the C scope?
Yes, gcc has special knowledge of the semantics of cpu_relax or whatever it expands to, and must translate it to something for which the hardware will respect the semantics too.
Yes, any kind of memory fencing primitive needs special respect by the compiler and hardware.
Look at what the macro expands to, e.g. compile with "gcc -E" and examine the output. You'll have to read the compiler documentation to find out the semantics of the primitives.
The scope of a memory fence is as wide as the scope the compiler might move a load or store across. A non-optimizing compiler that never moves loads or stores across a subroutine call might not need to pay much attention to a memory fence that is represented as a subroutine call. An optimizing compiler that does interprocedural optimization across translation units would need to track a memory fence across a much bigger scope.
There are a number subtle questions related to cpu and smp concurrency in your questions which will require you to look at the kernel code. Here are some quick ideas to get you started on the research specifically for the x86 architecture.
The idea is that you are trying to perform a concurrency operation where your kernel task (see kernel source sched.h for struct task_struct) is in a tight loop comparing my_variable with a local variable until it is changed by another kernel task (or change asynchronously by a hardware device!) This is a common pattern in the kernel.
The kernel has been ported to a number of architectures and each has a specific set of machine instructions to handle concurrency. For x86, cpu_relax maps to the PAUSE machine instruction. It allows an x86 CPU to more efficiently run a spinlock so that the lock variable update is more readily visible by the spinning CPU. GCC will execute the function/macro just like any other function. If cpu_relax is removed from the loop then gcc CAN consider the loop as non-functional and remove it. Look at the Intel X86 Software Manuals for the PAUSE instruction.
smp_mb is an x86 memory fence instruction that flushes the memory cache. One CPU can change my_variable in its cache but it will not be visible to other CPUs. smp_mb provides on-demand cache coherency. Look at the Intel X86 Software Manuals for MFENCE/LFENCE instructions.
Note that smp_mb() flushes the CPU cache so it CAN be an expensive operation. Current Intel CPUs have huge caches (~6MB).
If you expand cpu_relax on an x86, it will show asm volatile("rep; nop" ::: "memory"). This is NOT a compiler barrier but code that GCC will not optimize out. See the barrier macro, which is asm volatile("": : : "memory") for the GCC hint.
I'm not clear what you mean by "scope of cpu_relax". Some possible ideas: It's the PAUSE machine instruction, similar to ADD or MOV. PAUSE will affect only the current CPU. PAUSE allows for more efficient cache coherency between CPUs.
I just looked at the PAUSE instruction a little more - an additional property is it prevents the CPU from doing out-of-order memory speculation when leaving a tight loop/spinlock. I'm not clear what THAT means but I suppose it could briefly indicate a false value in a variable? Still a lot of questions....
This image and others like it have been bothering me for a while now. When I use malloc, this should be a part of the dynamic data, the heap. However, this seems to be bounded from above by the stack, which seems very ineffective. A program cannot predict how much memory I plan on allocating, so how does the program judge how far up to put the stack? It seems as though all of the memory in the middle is wasted, and I would like to know how this works for programs that could potentially range from a small service program that doesn't use any dynamic memory verses a videogame that could potentially allocate huge sections of memory.
Say, for example, I open up Microsoft paint. If I paste a high resolution picture into it, the memory allocation of paint skyrockets. Where was this memory taken from? What I would truly like is a snapshot of my entire RAM stick labelled as above to visualize how the many programs of a computer partition the computer's memory as a whole, but I can only find diagrams like this one for a single process and a single section of RAM.
Your picture is not of the RAM, but of the address space of some process in virtual memory. The kernel configures the MMU to manage virtual memory, provide the virtual address space of a process, and do some paging, and manage the page cache.
BTW, it is not the compiler which grows the stack (so your picture is wrong). The compiler is generating machine code which may push or pop things on the call stack. For malloc allocated heap, the C standard library implementation contains the malloc function, above operating system primitives or system calls allocating pages of virtual memory (e.g. mmap(2) on Linux).
On Linux, a process can ask its address space to be changed with mmap(2) -and munmap and mprotect. When a program is started with execve(2) the kernel is setting its initial address space. See also /proc/ (see proc(5) and try cat /proc/$$/maps....). BTW mmap is often used to implement malloc(3) and dlopen(3) -runtime loading of plugins, both heavily used in the RefPerSys project (it is a free software artificial intelligence project for Linux).
Since most Linux systems are open source, I suggest you to dive into implementation details by downloading then looking inside the source code of GNU libc or musl-libc: both implement malloc and dlopen above mmap and other syscalls(2).
Windows has similar facilities, but I don't know Windows. Refer to the documentation of the WinAPI
Read also Operating Systems: Three Easy Pieces, and, if you code in C, some good book such as Modern C and some C reference website. Be sure to read the documentation of your C compiler (e.g. GCC). See also the OSDEV website.
Be aware that modern C compilers are permitted to make extensive optimizations. Read what every C programmer should know about undefined behavior. See also this draft report.
In modern systems, a technique called virtual memory is used to give the program its own memory space. Heap and stack locations are at specific locations in virtual memory. The kernel then takes care of mapping memory locations between physical memory and virtual memory. Your program may have a chunk of memory allocated at 0x80000000, but that chunk of memory might be stored in location 0x49BA5400. Your actual RAM stick would be a jumbled mess of chunks of all of those sections in seemingly random locations.
I want to reserve/allocate a range of memory in RAM and the same application should not overwrite or use that range of memory for heap/stack storage. How to allocate a range of memory in ram protected from stack/heap overwrite?
I thought about adding(or allocating) an array to the application itself and reserve memory, But its optimized out by compiler as its not referenced anywhere in the application.
I am using ARM GNU toolchain for compiling.
There are several solutions to this problem. Listing in best to worse order,
Use the linker
Annotate the variable
Global scope
Volatile (maybe)
Linker script
You can obviously use a linker file to do this. It is the proper tool for the job. Pass the linker the --verbose parameter to see what the default script is. You may then modify it to precisely reserve the memory.
Variable Attributes
With more recent versions of gcc, the attribute used will also do what you want. Most modern gcc versions will support this. It is also significantly easier than the linker script; but only the linker script gives precise control over the position of the hole in a reliable manner.
Global scope
You may also give your array global scope and the compiler should not eliminate it. This may not be true if you use link time optimization.
Volatile
Theoretically, a compiler may eliminate a static volatile array. The volatile comes into play when you have code involving the array. It modifies the access behavior so the compiler will never caches access to that range. Dr. Dobbs on volatile At least the behavior is unclear to me and I would not recommend this method. It may work with some versions (and optimization levels) of the compiler and not others.
Limitations
Also, the linker option -gc-sections, can eliminate space reserved with either the global scope and the volatile methods as the symbol may not be annotated in any way in object formats; see the linker script (KEEP).
Only the Linker script can definitely restrict over-writes by the stack. You need to position the top of the stack before your reserved area. Typically, the heap grows up and the stack grows down. So these two collide with each other. This is particular to your environment/C library (for instance newlib is the typical ARM bare metal library). Looking at the linker file will give the best clue to this.
My guess is you want a fallow area to reserve for some sort of debugging information in the event of a system crash? A more explicit explaination of you problem would be helpful. You don't seem to be concerned with the position of the memory, so I guess this is not hardware related.