Cost of kernel's PAE option - linux-kernel

I am making a kernel(4.4) for 32bit(above i686) processors.
I guess almost 32bit PC has under 4GB RAM, but some PC has more than 4GB RAM.
So I consider about PAE kernel option.
But there is a cost.
like below...
CONFIG_X86_PAE: PAE is required for NX support, and furthermore enables larger swapspace support for non-overcommit purposes. It has the cost of more pagetable lookup overhead, and also consumes more pagetable space per process.
I can understand roughly.
But I can't feel excatly.
How big are the overhead and the pagetable space?

PAE is useful on x86 CPUs which are 32-bit but actually have an address bus of > 32 bit say like 40-bit. So they can actually access more memory. Now to deal with all this extra space the kernel has a larger page table hierarchy. So extra lookup and hence extra cost.

Related

Why 64 bit mode ( Long mode ) doesn't use segment registers?

I'm a beginner level of student :) I'm studying about intel architecture,
and I'm studying a memory management such as a segmentation and paging.
I'm reading Intel's manual and it's pretty nice to understand intel's architectures.
However I'm still curious about something fundamental.
Why in the 64bit long mode, all segment registers are going to bit 0?
Why system doesn't use segment registers any longer?
Because system's 64bit of size (such as a GP registers) are enough to contain those logical address at once?
Is protection working properly in 64bit mode?
I tried to find 64bit addressing but I couldn't find in Google. Perhaps I have terrible searching skill or I may need some specfied previous knowledge to searching in google.
Hence I'd like to know why 16bit of segment registers are not going to use in 64bit mode,
and how could protection work properly in 64bit mode.
Thank you!
In a manner of speaking, when you perform array ("indexed") type addressing with general registers, you are doing essentially the same thing as the segment registers. In the bad old days of 8-bit and 16-bit programming, many applications required much more data (and occasionally more code) than a 16-bit address could reach.
So many CPUs solved this by having a larger addressable memory space than the 16-bit addresses could reach, and made those regions of memory accessible by means of "segment registers" or similar. A program would set the address in a "segment register" to an address above the (65536 byte) 16-bit address space. Then when certain instructions were executed, they would add the instruction specified address to the appropriate (or specified) "segment register" to read data (or code) beyond the range of 16-bit addresses or 16-bit offsets.
However, the situation today is opposite!
How so? Today, a 64-bit CPU can address more than (not less than) all addressable memory space. Most 64-bit CPUs today can address something like 40-bits to 48-bits of physical memory. True, there is nothing to stop them from addressing a full 64-bit memory space, but they know nobody (but the NSA) can afford that much RAM, and besides, hanging that much RAM on the CPU bus would load it down with capacitance, and slow down ALL memory accesses outside the CPU chip.
Therefore, the current generation of mainstream CPUs can address 40-bits to 48-bits of memory space, which is more than 99.999% of the market would ever imagine reaching. Note that 32-bits is 4-gigabytes (which some people do exceed today by a factor of 2, 4, 8, 16), but even 40-bits can address 256 * 4GB == 1024GB == 1TB. While 64GB of RAM is reasonable today, and perhaps even 256GB in extreme cases, 1024GB just isn't necessary except for perhaps 0.001% of applications, and is unaffordable to boot.
And if you are in that 0.001% category, just buy one of the CPUs that address 48-bits of physical memory, and you're talking 256TB... which is currently impractical because it would load down the memory bus with vastly too much capacitance (maybe even to the point the memory bus would stop completely stop working).
The point is this. When your normal addressing modes with normal 64-bit registers can already address vastly more memory than your computer can contain, the conventional reason to add segment registers vanishes.
This doesn't mean people could not find useful purposes for segment registers in 64-bit CPUs. They could. Several possibilities are evident. However, with 64-bit general registers and 64-bit address space, there is nothing that general registers could not do that segment registers can. And general purpose registers have a great many purposes, which segment registers do not. Therefore, if anyone was planning to add more registers to a modern 64-bit CPU, they would add general purpose registers (which can do "anything") rather than add very limited purpose "segment registers".
And indeed they have. As you may have noticed, AMD and Intel keep adding more [sorta] general-purpose registers to the SIMD register-file, and AMD doubled the number of [truly] general purpose registers when they designed their 64-bit x86_64 CPUs (which Intel copied).
Most answers to questions on irrelevance of segment registers in a 32/64 bit world always centers around memory addressing. We all agree that the primary purpose of segment registers was to get around address space limitation in a 16 bit DOS world. However, from a security capability perspective segment registers provide 4 rings of address space isolation, which is not available if we do 64 bit long mode, say for a 64 bit OS. This is not a problem with current popular OS's such as Windows and Linux that use only ring 0 and ring 3 with two levels of isolation. Ring 1 and 2 are sometimes part of the kernel and sometimes part of user space depending on how the code is written. With the advent of hardware virtualization (as opposed to OS virtualization) from isolation perspective, hypervisors did not quite fit in either in ring 0 or ring 1/2/3. Intel and AMD added additional instructions (e.g., INTEL VMX) for root and non-root operations of VM's.
So what is the point being made? If one is designing a new secure OS with 4 rings of isolation then we run in to problems if segmentation is disabled. As an example, we use one ring each for hardware mux code, hypervisor code /containers/VM, OS Kernel and User Space. So we can make a case for leveraging additional security afforded by segmentation based on requirements stated above. However, Intel/AMD still allow F and G segment registers to have non-zero value (i.e., segmentation is not disabled). To best of my knowledge no OS exploits this ray of hope to write more secure OS/Hypervisor for hardware virtualization.

Is contiguous memory easier to get in a 64-bit address space? If so why?

A comment in this blog states:
We know how to make chunked heaps, but there would be some overhead to
using them. We have more requests for faster storage management than
we do for larger heaps in the 32-bit JVM. If you really want large
heaps, switch to the 64-bit JVM. We still need contiguous memory,
but it's much easier to get in a 64-bit address space.
This implication of the above statement is that it is easier to get contiguous memory in a 64-bit address space. Is this true? If so why?
That's very true. A process must allocate memory from the virtual memory address space. Which stores both code and data and whose size is restricted by the addressing capability of the architecture. You can never address more than 2^32 bytes in a 32-bit process, not counting bank-switching tricks. That's 4 gigabytes. The operating system typically takes a big chunk out of that as well, on 32-bit Windows for example that cuts down the addressable VM size to 2 gigabytes.
Ideally, allocations are made so that they fit snugly together. That very rarely works out in practice. Shared libraries or DLLs in particular need to pick a preferred load address and that has to be guessed up front when the library is built.
So in practice, the allocations are made from the holes in between existing ones and the largest possible contiguous allocation you can get is restricted by the size of the largest hole. Usually much smaller than the addressable VM size, on Windows it is typically around 650 megabytes. That tends to go down-hill from there as the available address space is getting fragmented by allocations. Particularly by native code that can't afford to have allocations moved by a compacting garbage collector. If you use Windows then you can get insight in the VM allocations with the SysInternals' VMMap utility.
This problem completely disappears in a 64-bit process. The theoretical addressable virtual memory size is 2^64, an enormous number. So large that current processors don't implement it, they can go up to 2^48. Further restricted by the operating system version you have and its willingness to keep page mapping tables for that much VM. Eight terabytes is a typical limit. By implication, the holes between allocations are huge. Your program will keel over on paging file thrashing before it dies from OOM.
I can't speak for how the JVM is implemented obviously, but from a purely theoretical viewpoint, if you have a significantly larger virtual address space (eg 64-bit as compared with 32-bit) it should be significantly easier to find a large block of contiguous memory which is available for allocation (going to extremes - you've got no chance of finding a contiguous 4GB of free memory in a 32-bit address space, but a significant chance of finding this space in a full 64-bit address space).
It should be noted that whatever the virtual address space size, this is still going to be implemented by allocation of (probably) non-contiguous physical memory pages, particularly if the requested allocation is large - the larger virtual address space just means there are a likely to be a lot more contiguous virtual addresses available for use.

Kernel memory address space

I've read that, on a 32-bit system with 4GB system memory, 2GB is allocated to user mode and 2GB allocated to kernel mode. But, If I had a system with 512 MB of memory, would it be partitioned as 256 MB to user and 256 MB to kernel address space?
You are confusing physical and virtual memory. 2GB is allocated to user/system, but it is virtual memory. It is even more correct to say that they are not rather allocated but they constitute an addressing space. Initially this space is not bound to physical memory at all. When application actually needs memory (first time is at start up) physical memory is allocated and some addresses from address space are mapped to it. When memory is allocated but not used long enough or PC is running out of physical memory data can be dumped in swap file, and stay there until requested. This mapping is transparent for application and it has no idea where data currently is: on chip or on HDD. So the address space is always splitted the same way.
This is not about memory (physical or virtual), but about address space.
You can plug 16GB of physical memory into your computer and make a 100GB swapfile, but 32-bit (non-enterprise) Windows will still only see 4GB (and subtract 0.75 GB for GPU memory and such). Via PAE, it could use more, but non-enterprise versions won't do that.
On top of the actual amount of memory, there is address space, which is limited to 4GB as well. Basically it is no more and no less than the collection of "numbers" (which, in this case, are addresses) that can be represented by a 32 bit number.
Since the kernel will need memory too, there is some arbitrary line drawn, which happens to be at the 2GB boundary for 32bit Windows, but can be configured differently, too.
It has nothing to do with the amount of memory on your computer (virtual or phsyical), but it is a limiting factor of how much memory you can use within a single program instance. It is not, however, a limiting factor on the memory that several programs could use.
As far as I can tell, what you are referring to are limits of how much memory can be allocated. This is much different than how much memory the OS allocated during runtime.

user-kernel address division

In ARM linux, the user-kernel virtual address range is divided in the ratio 3:1.
But in MIPS linux, this is usually 2:2
Does someone know what motivates this design difference ?
I have a faint idea that this has something to do with the fact that in MIPS, the TLB refill is managed in s/w and the kernel TLB entries are kind of hard-wired ensuring that they will never suffer a TLB miss.
This is a limitation of the MIPS 32-bit architecture. User mode is limited to 2GB on most MIPS CPUs.
Only the lower 2GB virtual addresses (0x0000_00000 to 0x7fff_ffff) are accessible in user mode. This part of the address space is called kuseg. Kuseg addresses are translated by the TLB. Whether TLB refills are done in software is irrelevant.
The kernel lives in the 512MB virtual space that extends from 0x8000_0000 to 9fff_ffff. This part of the address space is called kseg0. Kseg0 addresses are not translated by the TLB. These addresses are translated by removing the MSB (i.e. the virtual address range 0x8000_0000-9fff_ffff is hardwired to the physical address range 0x0000_0000-0x1fff_0000)
Refer to a MIPS manual for more details.

Extend the witdh of base address in 386 segment selector to excess the 4GB RAM limit in 32-bit OS?

As the memory requirement grows fast, today more and more system requires 64-bit machines to access even larger RAM.
FWIK in 386 protected mode, a memory pointer consists of two part: the base address (32-bit) specified by a segment selector, and the offset address (32-bit) added to the base address.
To re-compile all programs in 64-bit there's a lot of work to do, for example for C/C++ programs, the machine-dependent `int' type (which is 32-bit in 32-bit machine, and 64-bit in 64-bit machine) will cause problems if it's not used correctly. Even it's being rebuilt with no-problem, as the memory requirement continuous grow, for example someday we'll use 128-bit machines, and do we need to rebuild all the programs again to conform the new word size?
If we just extend the base address to 64-bit, thus make a segment like a 4GB window on the entire RAM, we don't even need a 64-bit OS at all, isn't it? Most of applications/processes won't have to access 4G+ memory, at server side, for example if a file server utilizes 20GB RAM for caching purpose, it may be split into 10 processes with each access 2GB, thus a 32-bit pointer is enough. And put each in different segment to cover 20GB memory.
Extend the segment limit is transparent to upper layer programs, what should be done is only about CPU and the OS, if we can let Linux to support to allocate memory on different 64-bit segments (though currently the segment base address is 32-bit yet), we can easily utilize 1TB RAM on 32-bit machine, isn't it?
Am I right?
The memory access is done on CPU, using assembly instructions. If the CPU have 32 bit for addressing a memory segment, it can address up to 4 GB, but no more. To extend this behavior, the CPU needs a 64 bit register.
A 32 bit OS has the same limitation. A 64 bit OS can execute 32 bit programs and make them access a base address higher than 4 GB, but needs a 64 bit processor.
As conclusion, the limit of the memory window accessible by the OS (and indirectly by the process running on that OS) are limited by the processor register width, in bits.
So, you are not right.
Propably the PAE fits your needs, but you need the hardware and the operative system support, which is very common as far as I know.
You can get exactly this effect today by running 32 bit processes on a 64 bit kernel. Each 32 bit process only has a 4GB virtual address space, but those addresses can be mapped anywhere in the physical memory accessible to the kernel. It's not done using segmentation, though; it's just done through paging.

Resources