What is the best way to reserve a very large virtual address space (TBs) in the kernel?

What is the best way to reserve a very large virtual address space (TBs) in the kernel? - memory-management

I'm trying to manually update TLB to translate new virtual memory pages into a certain set of physical pages in the kernel space. I want to do this in do_page_fault so that whenever a load/store instruction happens in a particular virtual address range (not already assigned), it put a page table entry in TLB in advance. The translation is simple. For example, I would like the following piece of code work properly:
int d;
int *pd = (int*)(&d + 0xffff000000000000);
*pd = 25; // A page fault occurs and TLB is updated
printk("d = %d\n", *pd); // No page fault (TLB is already up-to-date)
So, the translation is just a subtraction by 0xffff000000000000. I was wondering what is the best way to implement the TLB update functionality?
Edit: The main reason for doing that is to be able to reserve a large virtual memory space in the kernel. I just want to handle page faults in a particular address range in the kernel. So, first I have to reserve the address range (maybe exceeds the 100TB limitation). Then, I have to implement a page cache for that particular range. If it is not possible, what is the best solution for that?

Related

Allocate swappable memory in linux kernel

Memory in the Linux kernel is usually unswappable (Do Kernel pages get swapped out?). However, sometimes it is useful to allow memory to be swapped out. Is it possible to explicitly allocate swappable memory inside the Linux kernel? One method I thought of was to create a user space process and use its memory. Is there anything better?

You can create a file in the internal shm shared memory filesystem.
const char *name = "example";
loff_t size = PAGE_SIZE;
unsigned long flags = 0;
struct file *filp = shmem_file_setup(name, size, flags);
/* assert(!IS_ERR(filp)); */
The file isn't actually linked, so the name isn't visible. The flags may include VM_NORESERVE to skip accounting up-front, instead accounting as pages are allocated. Now you have a shmem file. You can map a page like so:
struct address_space *mapping = filp->f_mapping;
pgoff_t index = 0;
struct page *p = shmem_read_mapping_page(mapping, index);
/* assert(!IS_ERR(filp)); */
void *data = page_to_virt(p);
memset(data, 0, PAGE_SIZE);
There is also shmem_read_mapping_page_gfp(..., gfp_t) to specify how the page is allocated. Don't forget to put the page back when you're done with it.
put_page(p);
Ditto with the file.
fput(filp);

Answer to your question is a simple No, or Yes with a complex modification to kernel source.
First, to enable swapping out, you have to ask yourself what is happening when kswapd is swapping out. Essentially it will walk through all the processes and make a decision whether its memory can be swapped out or not. And all these memory have the hardware mode of ring 3. So SMAP essentially forbid it from being read as data or executed as program in the kernel (ring 0):
https://en.wikipedia.org/wiki/Supervisor_Mode_Access_Prevention
And check your distros "CONFIG_X86_SMAP", for mine Ubuntu it is default to "y" which is the case for past few years.
But if you keep your memory as a kernel address (ring 0), then you may need to consider changing the kswapd operation to trigger swapout of kernel addresses. Whick kernel addresses to walk first? And what if the address is part of the kswapd's kernel operation? The complexities involved is huge.
And next is to consider the swap in operation: When the memory read is attempted and it's "not present" bit is enabled, then hardware exception will trigger linux kernel memory fault handler (which is __do_page_fault()).
And looking into __do_page_fault:
https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/fault.c#L1477
and there after how it handler the kernel addresses (do_kern_address_fault()):
https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/fault.c#L1174
which essentially is just reporting as error for possible scenario. If you want to enable kernel address pagefaulting, then this path has to be modified.
And note too that the SMAP check (inside smap_violation) is done in the user address pagefaulting (do_usr_addr_fault()).

Catching and avoiding memory corruption at fixed offset in physical memory

We have a 4-byte memory corruption that always occurs at a fixed offset in the physical memory.
The physical frame number is 0x00a4d and the offset is ending with dc0.
Question 1) Based on this information, can we say the physical address of corruption is 0x00a4d * PAGE_SIZE (4096) + dc0 = 0x00A4DDC0. Programmatically, what is best way to confirm the physical address? Ours is ppc64 based system.
Question 2) What would be the best way to find out this memory corruption? The more I read the more I get lost with the plethora of options. Should I use KASAN, or CONFIG_DEBUG_PAGEALLOC (debug_guardpage_minorder) option or a HW breakpoint?
Question 3) Since we know the corruption is at a fixed option, if we were to reserve/block that page, what again is the best option? The two I came across are memmap and Reserved memory regions
Thanks

1.) You are right about physical address.
2.) HW breakpoint is the best if you have such possibility. Do you have the appropriate device (t32 or whatever) / debug port/ could it place HW break at physical address?
Here is the more generic and dumb case which needs no HW support:
If I remember right from your previous post, you suspect the kernel code as a corruption causer.
If you have read anything about KASAN, you probably mentioned that gcc part places hooks instead of kernel code loads and stores. The kernel part provides kasan_store_bla_bla_bla hook, which handles correctness of this store. Very likely, that default functionality wouldn't help you, but you can integrate your code in this kasan store hook, which would:
2.1)Take the virtual address passed to the store kasan hook
2.2)Finds appropriate physical address by page tables walking like this (the more convenient API exists but i don't remember the function name):
pgd_t *pgd = pgd_offset(mm, addr);
pud_t *pud = pud_offset(pgd, addr);
pmd_t *pmd = pmd_offset(pud, addr);
...
As i remember from your previous post you get crash in userspace app, so you will be need to check all processes mms from task list.
2.3) Compare found physical address to the given, and check that written value is zero (as i remember from your previous post)
2.4) If match print backtraces for all cores and stop execution.

How are base registers, limit registers and relocation registers used?

My understanding in address translation process in MMU(memory management unit)
-> logical address : generated by cpu.programmer concern with this address.
-> virtual address : reside in the hard disk , as a pages.
-> physical address : reside in the RAM. It is the actual address.
1: cpu generate the logical address and send it to the MMU.
2: MMU translate the logical address into the virtual address then translate it to the physical address and send the physical address to RAM.
3: when ever the RAM is full , the page which is not used rapidly is returned to the hard disk , to allocate memory to the other pages(processes).
my questions are :
1) where the value of Relocation register is added?
2) who decide the value of Relocation Register?
3) what to do with the Base register and Limit register , how to use it?
4) where the logical address goes off?
If any body can answer it , It would be grateful to me.
It is requested that , let me know it any misunderstanding in this topic.
-thanks

I can tell you how this works on x86.
All programs in non-64-bit modes operate with addresses combined of two items: segment selector (for brevity "selector" is often omitted in text and that may be confusing) and offset. This selector:offset pair is called the logical address.
The selector portion isn't always explicitly specified or manipulated with in code since the CPU has "default" associations of segment registers containing selectors with specific instructions or specific instruction encodings. It's also uncommon to manipulate selectors in 32-bit mode, but is very often necessary in 16-bit code.
The virtual address is formed from the logical address either "directly" (in real or 8086 virtual mode) or "indirectly" (in protected mode).
"Direct" virtual address = selector * 16 + offset.
"Indirect" virtual address = SegmentDescriptorTable[selector].Base + offset.
SegmentDescriptorTable is either the Global Descriptor Table (AKA GDT) or the Local Descriptor Table (AKA LDT). It's set up by the OS and describes the location and size of various segments of memory. selector is used to select a segment in the table. The Base entry of the table tells the segment's beginning (virtual address). The Limit entry tells the segment size (generally; the details are a little more complex).
When a program tries to access memory with an offset resulting access beyond the end of the segment (the CPU compares offset and Limit), the CPU generates an exception and the OS handles it, by usually terminating the program.
Btw, in real/v86 mode, even though the virtual address is formed directly from selector:offset, there's still a 16-bit Limit imposed on offsets, which is why you need to use a different selector to access more than 64KB of memory.
The Base entry in a segment descriptor can be used to either isolate the segment from the rest of the memory (Limit helps here) or to place or move the entire segment to an arbitrary virtual address without having to modify anything (or much) in the program it belongs to (if we're moving a segment, the data has to be moved in the memory, obviously). Basically, it can be used for relocation purposes. In real/v86 mode for relocation purposes the selector is changed.
The virtual address can be further translated to the physical address if the CPU is running in protected mode and has set up page tables. If there're no page tables, the physical address is the same as the virtual address. The translation is done in blocks of physical memory and address ranges that are called pages (often 4KB).
There's no dedicated relocation register on x86 CPUs. Relocation can be achieved by adjusting:
segment selectors in CPU registers or program's code
segment base addresses in GDT/LDT
offsets in program's code
physical addresses in page tables
As for virtual address : reside in the hard disk , as a pages, I'm not sure what exactly you want to say with this, but just because there's virtual to physical address translation, it doesn't mean there's also virtual on-disk memory. There are other uses for the translation besides virtual on-disk memory. And the addresses reside in the CPU and wherever your (and OS's) code writes them to, not necessarily on the disk.

Your description has a number of mistakes, much of which may be the result of imprecise documentation and common usage.
First of all, there really is no such a thing as a virtual address. There are physical and logical addresses. Sadly, the term virtual address is frequently (even in hardware documentation) used when logical address is what is meant..
The CPU instruction stream always operates on logical addresses (values may refer to physical addresses).
When the CPU needs to access a logical address, the MMU attempts to translate it to a physical addresses. It does that by looking up the address in a page table.
Several things can happen at that point:
There may not be a page table entry for the address => Access violation.
The page table entry is marked invalid => Access violation.
The page table entry indicates that no physical memory is mapped to it => Page fault.
(I omit mode access checks).
It is this last step that last step where virtual memory comes into play. At that point the page fault handler of the operating system needs to find where the corresponding page has been stored to disk, load it, update the page table, and restart the instruction.
The operating system manages the available physical memory by paging writeable memory (that has changed) to disk (read only data does not have to be written back) when there is high demand for physical memory.
I have never heard of a "relocation register" before. But doing a GOOGLE search I can see that some academic material uses it as a confusing pedagogical concept (i.e., with no relation to reality).
Some systems define the page table using base and limit registers. The base registers indicate where the page table starts in memory (this can be either a physical or logical addresses) and the limit register indicates the side of the table.
The registers are usually not loaded directly. Their values are usually written to the hardware Process Context Block (PCB). When the process context is loaded, the page table base and limit are loaded automatically.
On some systems there are multiple page tables. If there are system and user page tables, the user page tables can refer to logical addresses in the system space and the system page tables refer to physical addresses.

Page translation of process code section in Linux. Why does the Page Table Entry get 0 for some pages?

For some reason, I need to translate the virtual address of the code section to physical address. I did following experiment:
I get the virtual address from the start_code and end_code in mm_struct of process A, which are the initial address and final address of the executable code.
I get the CR3 of process A.
I translate the virtual address to physical address page by page. For example, there are 10 pages for code section in process A. I will translate 10 virtual address of each beginning of the page.
I found out some pages will get Page Table Entry(PTE) == 0.Some pages could successfully translate to a physical address.
I tried Firefox and Minicom as my Process, and both of them will get into situation.
I guess my question is: could anyone explain to me why PTE == 0? Does it mean these pages have been swap out to disk? If this is the case, how can I find these pages?
Thanks for any input!!

It looks as if you are trying to perform page table introspection without using the kernel APIs for it. Note that the address space is arranged in a red-black tree of vm_area_struct structs and you should probably use the APIs that traverse them. The mappings might change at any time so using the proper locking for these data structures is necessary.
For example, see the get_user_pages() function. It can be used to swap-in and temporarily pin the pages into memory. Using this function for page table introspection is usually asked for because once have the physical address in hand then the kernel can swap out the page at any time.

How to get a struct page from any address in the Linux kernel

I have existing code that takes a list of struct page * and builds a descriptor table to share memory with a device. The upper layer of that code currently expects a buffer allocated with vmalloc or from user space, and uses vmalloc_to_page to obtain the corresponding struct page *.
Now the upper layer needs to cope with all kinds of memory, not just memory obtained through vmalloc. This could be a buffer obtained with kmalloc, a pointer inside the stack of a kernel thread, or other cases that I'm not aware of. The only guarantee I have is that the caller of this upper layer must ensure that the memory buffer in question is mapped in kernel space at that point (i.e. it is valid to access buffer[i] for all 0<=i<size at this point). How do I obtain a struct page* corresponding to an arbitrary pointer?
Putting it in pseudo-code, I have this:
lower_layer(struct page*);
upper_layer(void *buffer, size_t size) {
for (addr = buffer & PAGE_MASK; addr <= buffer + size; addr += PAGE_SIZE) {
struct page *pg = vmalloc_to_page(addr);
lower_layer(pg);
}
}
and I now need to change upper_layer to cope with any valid buffer (without changing lower_layer).
I've found virt_to_page, which Linux Device Drivers indicates operates on “a logical address, [not] memory from vmalloc or high memory”. Furthermore, is_vmalloc_addr tests whether an address comes from vmalloc, and virt_addr_valid tests if an address is a valid virtual address (fodder for virt_to_page; this includes kmalloc(GFP_KERNEL) and kernel stacks). What about other cases: global buffers, high memory (it'll come one day, though I can ignore it for now), possibly other kinds that I'm not aware of? So I could reformulate my question as:
What are all the kinds of memory zones in the kernel?
How do I tell them apart?
How do I obtain page mapping information for each of them?
If it matters, the code is running on ARM (with an MMU), and the kernel version is at least 2.6.26.

I guess what you want is a page table walk, something like (warning, not actual code, locking missing etc):
struct mm_struct *mm = current->mm;
pgd = pgd_offset(mm, address);
pmd = pmd_offset(pgd, address);
pte = *pte_offset_map(pmd, address);
page = pte_page(pte);
But you you should be very very careful with this. the kmalloc address you got might very well be not page aligned for example. This sounds like a very dangerous API to me.

Mapping Addresses to a struct page
There is a requirement for Linux to have a fast method of mapping virtual addresses to physical addresses and for mapping struct pages to their physical address. Linux achieves this by knowing where, in both virtual and physical memory, the global mem_map array is because the global array has pointers to all struct pages representing physical memory in the system. All architectures achieve this with very similar mechanisms, but, for illustration purposes, we will only examine the x86 carefully.
Mapping Physical to Virtual Kernel Addresses
any virtual address can be translated to the physical address by simply subtracting PAGE_OFFSET, which is essentially what the function virt_to_phys() with the macro __pa() does:
/* from <asm-i386/page.h> */
132 #define __pa(x) ((unsigned long)(x)-PAGE_OFFSET)
/* from <asm-i386/io.h> */
76 static inline unsigned long virt_to_phys(volatile void * address)
77 {
78 return __pa(address);
79 }
Obviously, the reverse operation involves simply adding PAGE_OFFSET, which is carried out by the function phys_to_virt() with the macro __va(). Next we see how this helps the mapping of struct pages to physical addresses.
There is one exception where virt_to_phys() cannot be used to convert virtual addresses to physical ones. Specifically, on the PPC and ARM architectures, virt_to_phys() cannot be used to convert addresses that have been returned by the function consistent_alloc(). consistent_alloc() is used on PPC and ARM architectures to return memory from non-cached for use with DMA.
What are all the kinds of memory zones in the kernel? <---see here

For user-space allocated memory, you want to use get_user_pages, which will give you the list of pages associated with the malloc'd memory, and also increment their reference counter (you'll need to call page_cache_release on each page once done with them.)
For vmalloc'd pages, vmalloc_to_page is your friend, and I don't think you need to do anything.

For 64 bit architectures, the answer of gby should be adapted to:
pgd_t * pgd;
pmd_t * pmd;
pte_t * pte;
struct page *page = NULL;
pud_t * pud;
void * kernel_address;
pgd = pgd_offset(mm, address);
pud = pud_offset(pgd, address);
pmd = pmd_offset(pud, address);
pte = pte_offset_map(pmd, address);
page = pte_page(*pte);
// mapping in kernel memory:
kernel_address = kmap(page);
// work with kernel_address....
kunmap(page);

You could try virt_to_page. I am not sure it is what you want, but at least it is somewhere to start looking.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio