Why do we have memory zones in linux? - linux-kernel

I was reading this on a page that:
Because of hardware limitations, the kernel cannot treat all pages as identical. Some pages, because of their physical address in memory, cannot be used for certain tasks. Because of this limitation, the kernel divides pages into different zones.
I was wondering about those hardware limitation. Can somebody please explain me those hardware limitation and give an example. As well, is there any software guide from intel explaining this?
Also, I read that virtual memory is divided into two parts 1GB for kernel space and 3GB for user space. Why do we give 1GB space in the virtual space of all processes to kernel? How is it mapped to actual physical pages? Can somebody please point me to a clean text explaining this?
Thanks in advance.

The hardware limitations mostly concern old devies. For example, you have the ZONE_DMA, which is from 0 - 16MB. This is e.g. needed for older ISA Devices, which are not capable of adressing above the 16MB limit. Then you have the ZONE_NORMAL, where most of the kernel operations take place and is adressed permanently into the kernels adress space.
The 1GB and 3GB split is simple. You have virtual adresses here, so for your application, the memory adress always starts at 0x00000000, reserved are the 1st GB of this for kernel stuff. Why this is done is pretty simple: You have the kernel mode and the user mode. In kernel mode you are allowed to use system calls. If you would not have the kernel memory mapped to your virtual adress space, you would have to do a context switch to trap you into kernel mode (context switch: Save current context to memory, load another context from memory -> time consuming). But as kernel-mode operations can take place in the same virtual adress space, you dont need to switch the context to, for example, allocate new memory or do any other system call.

for your second question about 1GB kernel mapping in user space for a processor
kernel is mapped of course for time saving by not having switch. 1 GB is for kernel functionality so that if kernel maps new memory for its functionality kernel can do that. any book on Unix can give you details

Related

Can one remap kernel virtual address for use by kernel code

I am porting a large application to ARM32 Linux and splitting off the hardware stuff into a device driver. Nearly all of the extensive driver code uses absolute addresses to access buffers and I/O related variables and registers. I'd have to have to change all that to pointer relative addresses - a lot of code is in assembler as well.
From user space it is simple to use mmap to ask for a target virtual address for physical memory (via /dev/mem) so that side poses no issue.
But how can I do similar in kernel code ? IOremap and Memremap give you a random kernel virtual address, worse, loading a driver using INSMOD places both code and data (.bbs) in vmalloc memory.
remap_PFN_range can be used to map kernel memory to user space via mmap call (and with that ask for a given virtual address range) - but how can that be used from the kernel itself if at all ?
So for example, say I have a buffer at physical address 0x60000000 - how can I tell the Kernel to map that to a given kernel accessible virtual address (perhaps also 0x60000000 but could be anything as long as its known at compile time) ?
So far I have spent days surfing anything that mentions remapping but am not finding the "golden" answer. Anybody know if one exists ?
AFAIK there is no "easy" way to do that.
This document explains the memory layout of the Linux kernel memory, and as you see, modules has a specific mapping space which can't be changed as long as you load your code with init_module syscall, and dynamic memory that's allocated using stuff like kmalloc also has a specific range.
Maybe you'll be able to hack something together to create a buffer at a known address, but if my memory doesn't fool me, Linux kind of depends on the layout I mentioned above for some fundamental stuff (page faults etc...).
OK, I have the answer and it's embarrassingly simple.
In my case I am running a STM32MP157 chip under buildroot. It so happens 512MBytes of DRAM is placed at 0xC0000000 physical. This means kernel space virtual address = physical address. PAGE_OFFSET and PHYS_OFFSET are 0xC0000000 so they simply cancel out.
Right, to display a nice logo on startup a 3MByte framebuffer is allocated in CMA memory which starts at 0xD8000000. This is done in early kernel init and is the first thing in CMA. Later on I allocate more framebuffers via DRM but the first one stays.
It's unused after kernel boot - except it now isn't. It's my perfect solution - just read and write directly into 0xD8000000 to 0xD83FFFFF (physical location and size of that framebuffer). All the variables I need to have at locations known at compile time are located into that space. Directly accessible, no pointers needed. No need to modify my existing code other than tell the linker to place the variables at 0xD8000000.

How come the allocation of virtual address spaces doesn't rob you of all virtual memory?

On a 32-bit computer, a virtual memory address is represented as an integer between 0 and 2^32. By virtue of being a 32-bit system, no address can be represented that's lower than 0 or higher than 2^32, and we therefore have a total of 4 GiB (2^32 bytes) of virtual memory to use up. We also know that address spaces are memory protected; they cannot touch, because otherwise one process would be able to “step on the toes” of another. So, if all that I've said is correct, let me now ask this: if we grant that, by Microsoft’s own documentation, 2 GiB of virtual address space are used to operate the system and 2GiB of virtual address space are provided to a single user-mode process, have we not exhausted every possible virtual memory address on a 32-bit system? Would this not mean that we have to resort to disk-swapping just to run 2 processes? Surely, this is too ridiculous to be true, and I just want someone to clarify where my thinking has gone astray...
I have looked at the following questions but none of them seem to give satisfying/consistent/not hand-wavy answers. Or maybe I just don't understand them:
What is the maximum addressable space of virtual memory? - Stack Overflow
Virtual address space in windows - Stack Overflow
What happens when the number of possible virtual addresses are exceeded - Stack Overflow
Thanks! :)
TL;DR: Virtual memory is per-process and the address space changes when the OS switches execution from one process to another.
Remember that we are talking about virtual memory here and virtual memory is a trick that works because of the cooperation between the OS and the CPU hardware.
The split is not always at 2GB (there is a boot switch for 3GB etc.) but for this discussion lets assume it is always 2GB and that we are on a x86 machine.
When the CPU needs to access a an address in virtual memory it needs to translate the memory page from virtual to physical. The exact mechanics of how this works is too big of a topic to cover here but suffice to say that the translation involves a page directory that has information about present/swapped, modified, COW etc and a way to map the address to physical RAM (and if not present, the CPU will ask the OS to swap it in from the page-file).
The upper 2GB is where the kernel and drivers live and the mapping is always the same in all processes (but it can only be accessed in kernel mode (CPU ring 0)). The lower 2GB however are per-process. Each process have their own set of mappings. When the OS switches executing from one process to another (context switch) the page directory for the CPU the thread is about to run on is changed. This way each process has its own virtual address space.

Linux memory mapping

I got few questions about linux memory management(assume x86 32bit platform)
By default for all processes the top 1Gig of virtual address is mapped to kernel area. Theoretically the Kernel can map additional memory from high memory using vmalloc. My question is what happens with the page tables of all the user processes , I assume that they should get updates about the kernel memory allocation?( may be that memory will get used when the kernel is in process context).
Can someone explain from where The X86 logical address mapping limitation comes from? in "linux device drivers" chapter 15 it is said that there is a limitation on mapping logical address but with no deep explanation:
in many cases, even 32-bit processors can address more than 4 GB of physical memory. The limitation on how much memory can be directly mapped with logical addresses remains, however. Only the lowest portion of memory (up to 1 or 2 GB, depending on the hardware and the kernel configuration) has logical addresses; the rest (high memory) does not.
When does the kernel switch to its own page table(not including boot time)?. When its in process context, and interrupt context it uses the user mode process page table. The kernel threads use the process page table as well.
1.) There is only one set of 256 page tables that map the kernel's 1GiB region. The top 256 entries of each user space page directory point to these page tables. Thus, if the kernel changes a virtual mapping, all user space processes get the update as well.
2.) I'm not sure which limitations you mean, can you quote some text so I can find the passage in the book.
3.) When a process, like QEMU, starts a virtual CPU with kvm, the kernel swaps out the page table of the process, even though it doesn't yield to a different process. There may be more places like this, but in general, I don't think there is such a thing as a "kernel page table". All process page tables already map kernel memory, and it would thus seem wasteful to switch them out.
"Linux Device Drivers" is a great reference, but I can also recommend "Understanding the Linux Virtual Memory Manager", and of course, the kernel's source code.

Does the last GB in the address space of a linux process map to the same physical memory?

I read that the first 3 GBs are reserved for the process and the last GB is for the Kernel. I also read that the kernel is loaded starting from the 2nd MB of the physical address space (depending on the configuration). My question is that is the mapping of that last 1 GB is same for all processes and maps to this physical area of memory?
Another question is, when a process switches to kernel mode (eg, when a sys call occurs), then what page tables are used, the process page tables or the kernel page tables? If kernel page tables are used, then they can't access the memory locations belonging to the process. If that is the case, then there is apparently no use for the kernel virtual memory since all access to kernel code and data will be through the mapping of the last 1 GB of process address space. Please help me clarify this (any useful links will be much appreciated)
It seems, you are talking about 32-bit x86 systems, right?
If I am not mistaken, the kernel can be configured not only for 3Gb/1Gb memory distrubution, there could be other variants (e.g. 2Gb/2Gb). Still, 3Gb/1Gb is probably the most common one on x86-32.
The kernel part of the address space should be inaccessible from the user space. From the kernel's point of view, yes, the mapping of the memory occupied by the kernel itself is always the same. No matter, in the context of which process (or interrupt handler, or whatever else) the kernel currently operates.
As one of the consequences, if you look at the addresses of kernel symbols in /proc/kallsyms from different processes, you will see the same addresses each time. And these are exactly the addresses of the respective kernel functions, variables and others from the kernel's point of view.
So I suppose, the answer to your first question is "yes" but it is probably not very useful for the user-space code as the kernel space memory is not directly accessible from there anyway.
As for the second question, well, if the kernel currently operates in the context of some process, it can actually access the user-space memory of that process. I can't describe it in detail but probably the implementation of kernel functions copy_from_user and copy_to_user could give you some hints. See arch/x86/lib/usercopy_32.c and arch/x86/include/asm/uaccess.h in the kernel sources. It seems, on x86-32, the user-space memory is accessed in these functions directly, using the default memory mappings for the current process context. The 'magic' stuff there is only related to the optimizations and checking the address of the memory area for correctness.
Yes, the mapping of the kernel part of the address space is the same in all processes. Part of it does map that part of the physical memory where the kernel image is loaded, but that's not the bulk of it - the remainder is used to map other physical memory locations for the kernel's runtime working set.
When a process switches to kernel mode, the page tables are not changed. The kernel part of the address space simply becomes accessible because the CPL (Current Privilege Level) is now zero.

Difference between physical addressing and virtual addressing concept

This is a re-submission, because I am not getting any response from superuser.com. Sorry for the misunderstanding.
I need to know the difference between physical addressing and virtual addressing concept in embedded systems.
Why virtual addressing concept is implemented in embedded systems?
What is the advantage of the virtual addressing over a system with physical addressing concept in embedded systems?
How the mapping between virtual addressing to physical addressing is done in embedded systems?
Please, explain the above concept with some simple examples in some simple architecture.
Physical addressing means that your program actually knows the real layout of RAM. When you access a variable at address 0x8746b3, that's where it's really stored in the physical RAM chips.
With virtual addressing, all application memory accesses go to a page table, which then maps from the virtual to the physical address. So every application has its own "private" address space, and no program can read or write to another program's memory. This is called segmentation.
Virtual addressing has many benefits. It protects programs from crashing each other through poor pointer manipulation, etc. Because each program has its own distinct virtual memory set, no program can read another's data - this is both a safety and a security plus. Virtual memory also enables paging, where a program's physical RAM may be stored on a disk (or, now, slower flash) when not in use, then called back when an application attempts to access the page. Also, since only one program may be resident at a particular physical page, in a physical paging system, either a) all programs must be compiled to load at different memory addresses or b) every program must use Position-Independent Code, or c) some sets of programs cannot run simultaneously.
The physical-virtual mapping may be done in software (with hardware support for memory traps) or in pure hardware. Sometimes even the page tables themselves are on a special set of hardware memory. I don't know off the top of my head which embedded system does what, but every desktop has a hardware TLB (Translation Lookaside Buffer, basically a cache for the virtual-physical mappings) and some now have advanced Memory Mapping Units that help with virtual machines and the like.
The only downsides of virtual memory are added complexity in the hardware implementation and slower performance.
The VAX (Virtual Address eXtented by Digital Equipment Corp which became Compaq, which became HP) is a very good example of an virtual embeded hardware system. It was a 32 bit mini computer that had an OS called VMS or Virtual Memory Systems. Dave Cutler was one of the principle architets of the systems and he much later wrote the Kernal for Windows NT. He is a very good read for this and other stuff. The Vax had special hardware for control of the virtual space and control of opcode access for security through hardware... very secure. This system was or is the grandfather of the modfern day PC at the Kernal Level. The first BSOD I saw on WNT 3.51 I was able to read because it came from the crash dump used in VMS to stop the system when unstable. By te way Look at the name VMS and WNT and you will find the next letters in the alhabet from VMS makes the term WNT. This was not an accident. maybe a jab at DEC for letting him go.

Resources