Does mlock prevent the page from appearing in a core dump? - memory-management

I have a process with some sensitive memory which must never be written to disk.
I also have a requirement that I need core dumps to satisfy first-time data capture requirements of my client.
Does locking a page using mlock() prevent the page from appearing in a core dump?
Note, this is an embedded system and we don't actually have any swap space.

Taken from man 2 madvise:
The madvise() system call advises the kernel about how to handle
paging input/output in the address range beginning at address addr and
with size length bytes. It allows an application to tell the kernel
how it expects to use some mapped or shared memory areas, so that the
kernel can choose appropriate read-ahead and caching techniques.
This call does not influence the semantics of the application (except
in the case of MADV_DONTNEED), but may influence its performance. The
kernel is free to ignore the advice.
Particularly check the option MADV_DONTDUMP :
Exclude from a core dump those pages in the range specified by addr
and length. This is useful in applications that have large areas of
memory that are known not to be useful in a core dump. The effect of
MADV_DONTDUMP takes precedence over the bit mask that is set via the
/proc/PID/coredump_filter file (see core(5)).

Related

Mmap a writecombine region to userspace documentation

According to the following documentation
https://www.kernel.org/doc/html/latest/x86/pat.html,
Drivers wanting to export some pages to userspace do it by using mmap interface and a combination of:
pgprot_noncached()
io_remap_pfn_range() or remap_pfn_range() or vmf_insert_pfn()
Note that this set of APIs only works with IO (non RAM) regions. If driver wants to export a RAM region, it has to do set_memory_uc() or set_memory_wc() as step 0 above and also track the usage of those pages and use set_memory_wb() before the page is freed to free pool.
Why is the extra step set_memory_uc() or set_memory_wc() needed for RAM regions?
This is needed since set_memory_uc() and set_memory_wc() are specifically written to work with memory regions; the other API functions you're being told to use here are for I/O regions.
Since you want to work with page(s) in a RAM region using the API functions listed, your driver needs to mark them as uncached or write-combined first so that they can essentially be treated like I/O pages, use the APIs, and then be sure to follow up with explicit writeback(s) of the memory page(s) in order to sync their contents before your driver considers itself "finished" with them.

accessing process memory parts

I'm currently studying memory management of OS by the video lecture. The instructor says,
In fact, you may have, and it is quite often the case that there may
be several parts of the process memory, which are not even accessed at
all. That is, they are neither executed, loaded or stored from memory.
I don't understand the saying since even if in a simple C program, we access whole address space of it. Don't we?
#include <stdio.h>
int main()
{
printf("Hello, World!");
return 0;
}
Could you elucidate the saying? If possible could you provide an example program wherein "several parts of the process memory, which are not even accessed at all" when it is run.
Imagine you have a large and complicated utility (e.g. a compiler), and the user asks it for help (e.g. they type gcc --help instead of asking it to compile anything). In this case, how much of the utility's code and data is used?
Most programs have various optional parts that aren't used (e.g. maybe something that works with graphics will have some code for 16 bits per pixel and other code for 32 bits per pixel, and will determine which code to use and not use the other code). Most heap allocators are "eager" (e.g. they'll ask the OS for 20 MiB of space and then might only "malloc() 2 MiB of it). Sometimes a program will memory map a huge file but then only access a small part of it.
Even for your trivial "hello world" example code; the virtual address space probably contains a huge (several MiB) shared library to support lots of C standard library functions (e.g. puts(), fprintf(), sprintf(), ...) and your program only uses a small part of that shared library; and your program probably reserves a conservative amount of space for its stack (e.g. maybe 20 KiB of space for its stack) and then probably only uses a few hundred bytes of stack.
In a virtual memory system, the address space of the process is created in secondary store at start up. Little or nothing gets placed in memory. For example, the operating system may use the executable file as the page file for the code and static data. It just sets up an internal structure that says some range of memory is mapped to these blocks in the executable file. The same goes for shared libraries. The other data gets mapped to the page file.
As your program runs it starts page faulting rapidly because nothing is in memory and the operating system has to load it from secondary storage.
If there is something that your program does not reference, it never gets loaded into memory.
If you had global variable declared like
char somedata [1045] ;
and your program never references that variable, it will never get loaded into memory. The same goes for code. If you have pages of code that done get execute (e.g. error handling code) it does not get loaded. If you link to shared libraries, you will likely bece including a lot of functions that you never use. Likewise, they will not get loaded if you do not execute them.
To begin with, not all of the address space is backed by physical memory at all times, especially if your address space covers 248+ bytes, which your computer doesn't have (which is not to say you can't map most of the address space to a single physical page of memory, which would be of very little utility for anything).
And then some portions of the address space may be purposefully permanently inaccessible, like a few pages near virtual address 0 (to catch NULL pointer dereferences).
And as it's been pointed out in the other answers, with on-demand loading of programs, you may have some portions of the address space reserved for your program but if the program doesn't happen to need any of its code or data there, nothing needs to be loader there either.

x86 - kernel - programs, cleaning and memory overwrite

I am not sure about something. Take linux for example; when a program exits, the kernel is responsible for cleaning after the process.
How can one be sure that physical memory is never overwritten from process A to process B (different virtual memories (page entries) leading to the same physical allocation)?
How is it prevented?
Linux assigns pages to and frees pages from processes using the facilities described here.(Search the kernel sources for more detailed information.)
That means, the kernel saves information about the used pages in some data structure (could be a bitmap, for example) and only the unused ones are exposed as usable to new processes.
That prevents mistakenly assigning pages in use to new process. Any behavior beyond that would be a bug and a magnificent security hole.

Linux kernel ARM Translation table base (TTB0 and TTB1)

Compiled Linux kernel 2.6.34.3 for ARMv7 (Cortex-a8)
I looked into the kernel code and it looks like the Linux kernel sets the hardware page tables for the kernel address space (everything over 0xC0000000)on TTB1 (translation table base) and the user process on ttb0 (everything under 0xC0000000) which changes for every process context switch. Is this correct? I'm still confused how the MMU knows which ttb to look at for translations?
I read that the TTBCR (translation table base control register) determines which of the ttb register to walk when an MVA is not found, however the register always reads 0 which means always use TTBR0 in the ARM architecture reference manual. How is that possible? Can anyone explain to me how the Linux kernel uses these two ttbs?
I read how the ttb works from this site https://www.cs.rutgers.edu/~pxk/416/notes/10-paging.html but I still dont understand how the kernel use the two ttbs
(Double checked the kernel code, for some reason both ttb0 and ttb1 is set, but it seems like ttb1 is never used, i set the TTB1 register to 0 and the Linux kernel continue to run as usual)
The TTBR registers are used together to determine addressing for the full 32-bit or 40-bit address space. Which register is used for what address ranges is controlled via the tXsz bits in the TTBCR. There is an entry for t0sz corresponding to TTBR0 and t1sz for TTBR1.
The page tables addressed by each TTBRx register are independent, but you typically find most Linux implementations just use TTBR0. Linux expects to be able to use a 3G/1G address space partitioning scheme, which is not supported by ARM. If you look at page B3-1345 of the ARMv7 Architecture Reference Manual, you'll see that the value of t0sz and t1sz determine the address ranges supported by TTBR0 and TTBR1 respectively. To add confusion to disorientation, it is even possible to have disjoined address spaces where TTBR0 and TTBR1 support ranges that are not contiguous, resulting in a hole in the system address space. Good times!
To answer your main question though, it is recommended by ARM that TTBR0 be used to store the offset to the page tables used by USER processes, and TTBR1 be used to store the offset to the page tables used by the KERNEL. I have yet to see a single implementation that actually does this. Almost exclusively TTBR0 is used in all cases, with TTBR1 containing a duplicate copy of the L1 tables.
So how does this work? The value of TTBR is stored as part of the process state and simply restored each time a process with switched out. This is how it is expected to work. Originally, TTBR1 would hold a constant value for the kernel tables and never be replaced or swapped out, whereas TTBR0 would be changed each time you context switch between processes. Apparently most Linux implementations for ARM have decided to just basically eliminate the use of TTBR1 and stick to using TTBR0 for everything.
If you want to test this theory on your device, try whacking TTBR1 and watch nothing happen. Then try whacking TTBR0 and watch your system crash. I've yet to encounter a single instance that didn't result in this exact same result. Long story short, TTBR1 is useless by Linux, and TTBR0 is used almost exclusively and simply swapped out.
Now, once you get to LPAE support, throw all this away and start over again. This is the implementation where you will start to see the value of t0sz and t1sz being something other than zero, and hence N as well.
I have very little knowledge about ARM architecture, but from what I read in your enclosed link, then I guess Linux implements its virtual-memory management that way:
High-order bits of the virtual address determine which one to use. The base of the table is stored in one of two base registers (TTBR0 or TTBR1), depending on whether the topmost n bits of the virtual address are 0 (use TTBR0) or not (use TTBR1). The value for n is defined by the Translation Table Base Control Register (TTBCR).
The register TTBCR tells which addresses will be translated from page-tables pointed to by TTBR0 or TTBR1. If TTBCR contains 0xc000000, then any address from 0 to 0xbfffffff is translated by the page-table pointed by TTBR0, and any address from 0xc0000000 to 0xffffffff is translated by the page-table pointed by TTBR1. That match the Linux memory-split of 3GB for user process / 1GB for the kernel.
This allows one to have a design where the operating system and memory-mapped I/O are located in the upper part of the address space and managed by the page table in TTBR1 and user processes are in the lower part of memory and managed by the page table in TTB0. On a context switch, the operating system has to change TTBR0 to point to the first-level table for the new process. TTBR1 will still contain the memory map for the operating system and memory-mapped I/O.
Hence, the value of TTBR1 should never change because you want the kernel to be permanently mapped (think of what happens when an interrupt is raised). On the other hand, TTBR0 is modified at every process-switch, it contains the page-table of the current process.
See http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0211k/Bihgfcgf.html
For ARM5 and lower the TTB table is fixed in size and alignment (to 16k). Each level 1 entry represents 1MB. The table entry is 32bits (16k*1M/(32bit/8) = 4GB). The TTBCR controls TTBR0 table size. From the above URL,
Selecting which Translation Table Base Register is used
The Translation Table Base Register is selected as follows:
If N = 0, always use Translation Table Base Register 0.
- This is the default case at reset. It is backwards compatible with ARMv5 or earlier processors.
If N is greater than 0, then:
- if bits [31:32-N] of the Virtual Address are all 0, use Translation Table Base Register 0 otherwise use Translation Table Base Register 1.
So the size of TTBR0 also sets the memory split. For a traditional Linux 3G/1G 1G/3G, the value 2 should be selected. 4kB table == 1G memory == bits 31..30 are zero. For a value of 6 the table is 256byte == 64MB == bits 31..26 are zero.
In Linux parlance these are page global entries (and this splits this page global directory). The entries can point to another table or just be a 1MB segment. The next table entries are page middle Linux directories and then the final page table entries. I think the page middle entries are unused on the ARM.
The MMU hardware doesn't walk the tables every time. There is a TLB (translation look aside buffer). It is like a cache for the MMU tables. When the OS updates these tables, the TLB must be flushed or the processor will use stale entries. Similarly the ARM cache is virtual tagged, so changing the mapping may also mean the cache must be flushed. For these reasons, you never want to change things on a context switch. Shared libraries text (say libc.so) should be the same on a context switch. Hopefully each process has libc.so mapped at the same virtual address. There is a big gain in doing this; lower memory use and good I-cache use.
The domain and PID registers as well as supervisor/user modes can also control memory accesses. These are single registers that can be toggled on a context switch.
See http://lwn.net/images/conf/rtlws11/papers/proc/p01.pdf for info on PID and domain use on the ARMV5. The current Linux source doesn't do exactly like the paper describes. It is entirely possible that Linux doesn't need to use this mechanism and sets the TTBCR to zero so that the VM code for ARM sub-architectures is similar.
Edit: I don't believe the TTBCR functionality can be used to achieve a 3G/1G split. I think the Rutger's page was discussing the TTBCR generically and not in the Linux context. Also, at least the 2.6.38 Linux used domains or DACR but does not use the pid or fcse as it supports a limited number of processes.
http://lwn.net/Articles/106177/ - also referenced on the Rutgers page.
The TTBR0 holds the base address of translation table 0, and information about the memory it occupies.
This is one of the translation tables for the stage 1 translation of memory accesses from modes other than Hyp mode

Mapping of Page allocated to user process in Kernel virtual address space

When a page is created for a process (which will be mapped into process address space), will that page be mapped into kernel address space ?
If not, then it won't have kernel virtual address. Then how the swapper will find the page and swap that out, if a need arises ?
If we're talking about the x86 or similar (in terms of page translation) architectures, at any given time there's one virtual address space and normally one part of it is reserved for the kernel and the other for user-mode processes.
On a context switch between two processes only the user-mode part of the virtual address space changes.
With such an organization, the kernel always has full access to the current user-mode process, because, again, there's only one current virtual address space at any moment for both the kernel and a user-mode process, it's not two, it's one. So, the kernel doesn't really have to have another, extra mapping for user-mode pages. But that's not the main point.
The main point is that the kernel keeps some sort of statistics for every page that if needed can be saved to the disk and reused elsewhere. The CPU marks each page's page table entry (PTE) as accessed when the page is first read from or written to and as dirty when it's first written to.
The kernel scans the PTEs periodically, reads the accessed and dirty markers to update said statistics and clears accessed and dirty so it can detect a change in them later (of course, if any). Based on this statistics it determines which pages are rarely used or long unused and can be repurposed.
If the "swapper" runs in the context of the current process and if it runs in the kernel, then in theory it has enough information from the kernel (the list of rarely used or long unused pages to save and unmap if dirty or just unmap if not dirty) and sufficient access to the pages of interest.
If the "swapper" itself runs as a user-mode process, things become more complicated because it doesn't have access to another process' pages by default and has to either create a mapping or ask the kernel do some extra work for it in the context of the process of interest.
So, finding rarely used and long unused pages and their addresses occurs in the kernel. The CPU helps by automatically marking PTEs as accessed and dirty. There may need to be an extra mapping to dirty pages if they get saved to the disk not in the context of the process that owns them.

Resources