ARM Linux: PTE not writable but dirty

ARM Linux: PTE not writable but dirty - linux-kernel

I am aware that ARM architecture emulates the Linux's young and dirty flags by setting them in page fault handlers as discussed here. But recently for a small binary, I observed that a Linux PTE in one of the anonymous segments was set to be not writable and dirty. The following Linux PTE state was observed:
- L_PTE_PRESENT : 1
- L_PTE_YOUNG : 1
- L_PTE_DIRTY : 1
- L_PTE_RDONLY : 1
- L_PTE_XN : 0
I couldn't find an explanation for this combination of PTE flags. Does the kernel set this combination for special anonymous VMA segments? What does this combination signify? Any pointers will be helpful. Thanks in advance.

I observed that a Linux PTE in one of the anonymous segments was set to be not writable and dirty... What does this combination signify?
TL;DR - This simply means that the page is not in a backing store and it is read-only.
Dirty just means not written to a backing store (swap, mmap file or inode). Many things such as code are always read from a file, so they are backed by an inode.
If you mmap some read-only memory, then you could get this combination, for example. Other possibilities are a stack guard, allocator run-time buffer overflow detection, and copy-on-write functionality.
These are not normal. For a typical allocation, you will have something backed by swap and only a write will cause the page to become dirty. So the case is probably less frequent but valid.
See: ARM Linux PTE bits
ARM Linux emulate dirty/accessed
There seems to be little documentation on what the young bit means. young is information about what to swap. If something is young and not accessed for a prolonged time, it is a good candidate to evict. In contrast, dirty is for whether it needs to be swapped. If a page is dirty, then it has not been written to a backing store (a swap file or mmap file, etc). The pager must write out this page then. If it was not dirty (or clean), then the pager can simply discard the memory and re-use.
The difference between young and dirty is like should and must.
- L_PTE_PRESENT : 1 - it has physical RAM (not swapped)
- L_PTE_YOUNG : 1 - is has not been used
- L_PTE_DIRTY : 1 - it is different than backing store
- L_PTE_RDONLY : 1 - user space can not write.
- L_PTE_XN : 0 - code can execute.
Not present and dirty seem like an impossible condition for instance, but dirty and read-only is valid.

Related

x86 segmentation, DOS, MZ file format, and disassembling

I'm disassembling "Test Drive III". It's a 1990 DOS game. The *.EXE has MZ format.
I've never dealt with segmentation or DOS, so I would be grateful if you answered some of my questions.
1) The game's system requirements mention 286 CPU, which has protected mode. As far as I know, DOS was 90% real mode software, yet some applications could enter protected mode. Can I be sure that the app uses the CPU in real mode only? IOW, is it guaranteed that the segment registers contain actual offset of the segment instead of an index to segment descriptor?
2) Said system requirements mention 1 MB of RAM. How is this amount of RAM even meant to be accessed if the uppermost 384 KB of the address space are reserved for stuff like MMIO and ROM? I've heard about UMBs (using holes in UMA to access RAM) and about HMA, but it still doesn't allow to access the whole 1 MB of physical RAM. So, was precious RAM just wasted because its physical address happened to be reserved for UMA? Or maybe the game uses some crutches like LIM EMS or XMS?
3) Is CS incremented automatically when the code crosses segment boundaries? Say, the IP reaches 0xFFFF, and what then? Does CS switch to the next segment before next instruction is executed? Same goes for SS. What happens when SP goes all the way down to 0x0000?
4) The MZ header of the executable looks like this:
signature 23117 "0x5a4d"
bytes_in_last_block 117
blocks_in_file 270
num_relocs 0
header_paragraphs 32
min_extra_paragraphs 3349
max_extra_paragraphs 65535
ss 11422
sp 128
checksum 0
ip 16
cs 8385
reloc_table_offset 30
overlay_number 0
Why does it have no relocation information? How is it even meant to run without address fixups? Or is it built as completely position-independent code consisting from program-counter-relative instructions? The game comes with a cheat utility which is also an MZ executable. Despite being much smaller (8448 bytes - so small that it fits into a single segment), it still has relocation information:
offset 1
segment 0
offset 222
segment 0
offset 272
segment 0
This allows IDA to properly disassemble the cheat's code. But the game EXE has nothing, even though it clearly has lots of far pointers.
5) Is there even such thing as 'sections' in DOS? I mean, data section, code (text) section etc? The MZ header points to the stack section, but it has no information about data section. Is data and code completely mixed in DOS programs?
6) Why even having a stack section in EXE file at all? It has nothing but zeroes. Why wasting disk space instead of just saying, "start stack from here"? Like it is done with BSS section?
7) MZ header contains information about initial values of SS and CS. What about DS? What's its initial value?
8) What does an MZ executable have after the exe data? The cheat utility has whole 3507 bytes in the end of the executable file which look like
__exitclean.__exit.__restorezero._abort.DGROUP#.__MMODEL._main._access.
_atexit._close._exit._fclose._fflush._flushall._fopen._freopen._fdopen
._fseek._ftell._printf.__fputc._fputc._fputchar.__FPUTN.__setupio._setvbuf
._tell.__MKNAME._tmpnam._write.__xfclose.__xfflush.___brk.___sbrk._brk._sbrk
.__chmod.__close._ioctl.__IOERROR._isatty._lseek.__LONGTOA._itoa._ultoa.
_ltoa._memcpy._open.__open._strcat._unlink.__VPRINTER.__write._free._malloc
._realloc.__REALCVT.DATASEG#.__Int0Vector.__Int4Vector.__Int5Vector.
__Int6Vector.__C0argc.__C0argv.__C0environ.__envLng.__envseg.__envSize
Is this some kind of debugging symbol information?
Thank you in advance for your help.

Re. 1. No, you can't be sure until you prove otherwise to yourself. One giveaway would be the presence of MOV CR0, ... in the code.
Re. 2. While marketing materials aren't to be confused with an engineering specification, there's a technical reason for this. A 286 CPU could address more than 1M of physical address space. The RAM was only "wasted" in real mode, and only if an EMM (or EMS) driver wasn't used. On 286 systems, the RAM past 640kb was usually "pushed up" to start at the 1088kb mark. The ISA and on-board peripherals' memory address space was mapped 1:1 into the 640-1024kb window. To use the RAM from the real mode needed an EMM or EMS driver. From protected mode, it was simply "there" as soon as you set up the segment descriptor correctly.
If the game actually needed the extra 384kb of RAM over the 640kb available in the real mode, it's a strong indication that it either switched to protected mode or required the services or an EMM or EMS driver.
Re. 3. I wish I remembered that. On reflection, I wish not :) Someone else please edit or answer separately. Hah, I did know it at some point in time :)
Re. 4. You say "[the code] has lots of instructions like call far ptr 18DCh:78Ch". This implies one of three things:
Protected mode is used and the segment part of the address is a selector into the segment descriptor table.
There is code there that relocates those instructions without DOS having to do it.
There is code there that forcibly relocates the game to a constant position in the address space. If the game doesn't use DOS to access on-disk files, it can remove DOS completely and take over, gaining lots of memory in the process. I don't recall whether you could exit from the game back to the command prompt. Some games where "play until you reboot".
Re. 5. The .EXE header does not "point" to any stack, there is no stack section you imply, the concept of sections doesn't exist as far as the .EXE file is concerned. The SS register value is obtained by adding the segment the executable was loaded at with the SS value from the header.
It's true that the linker can arrange sections contiguously in the .EXE file, but such sections' properties are not included in the .EXE header. They often can be reverse-engineered by inspecting the executable.
Re. 6. The SS and SP values in the .EXE header are not file pointers. The EXE file might have a part that maps to the stack, but that's entirely optional.
Re. 7. This is already asked and answered here.
Re. 8. This looks like a debug symbol list. The cheat utility was linked with the debugging information left in. You can have completely arbitrary data there - often it'd various resources (graphics, music, etc.).

Does mlock prevent the page from appearing in a core dump?

I have a process with some sensitive memory which must never be written to disk.
I also have a requirement that I need core dumps to satisfy first-time data capture requirements of my client.
Does locking a page using mlock() prevent the page from appearing in a core dump?
Note, this is an embedded system and we don't actually have any swap space.

Taken from man 2 madvise:
The madvise() system call advises the kernel about how to handle
paging input/output in the address range beginning at address addr and
with size length bytes. It allows an application to tell the kernel
how it expects to use some mapped or shared memory areas, so that the
kernel can choose appropriate read-ahead and caching techniques.
This call does not influence the semantics of the application (except
in the case of MADV_DONTNEED), but may influence its performance. The
kernel is free to ignore the advice.
Particularly check the option MADV_DONTDUMP :
Exclude from a core dump those pages in the range specified by addr
and length. This is useful in applications that have large areas of
memory that are known not to be useful in a core dump. The effect of
MADV_DONTDUMP takes precedence over the bit mask that is set via the
/proc/PID/coredump_filter file (see core(5)).

Cleared RW (write protect) flag for PTEs of a process in kernel yet no segmentation fault on write

I implemented incremental process checkpointing at page level(I just dump the data from the process address space into a file).
The approach I used is as follows. I used two system calls:
Complete Checkpoint: copy entire address space. Also if write bit
is set for a page, clear it.
Incremental checkpoint: only dump data if write bit is set and clear it again. So basically, I check if write bit is set for an incremental checkpoint. If yes, dump the page data.
Test program:
char a[10000];
sys_cp_range(a,a+10000);
a[3]='A';
sys_incr_cp_range(a,a+10000);
From what I know, the kernel should be doing page fault and handle illegal write case by killing the process with SIGSEGV. Yet the program is successfully checkpointed.
What is exactly happening here ?

If you modify a PTE when it's still cached in the TLB, the effect of the modification may be unseen for a while (until the PTE gets evicted from the TLB and has to be reread from the page table).
You need to invalidate the PTE in the TLB with the invlpg (I'm assuming x86) instruction after PTE modification. And it has to be done on all CPUs. There must be a dedicated function for this purpose in the kernel.
Also it wouldn't hurt to double check that the compiler didn't reorder or throw away anything from the above code.

Linux kernel ARM Translation table base (TTB0 and TTB1)

Compiled Linux kernel 2.6.34.3 for ARMv7 (Cortex-a8)
I looked into the kernel code and it looks like the Linux kernel sets the hardware page tables for the kernel address space (everything over 0xC0000000)on TTB1 (translation table base) and the user process on ttb0 (everything under 0xC0000000) which changes for every process context switch. Is this correct? I'm still confused how the MMU knows which ttb to look at for translations?
I read that the TTBCR (translation table base control register) determines which of the ttb register to walk when an MVA is not found, however the register always reads 0 which means always use TTBR0 in the ARM architecture reference manual. How is that possible? Can anyone explain to me how the Linux kernel uses these two ttbs?
I read how the ttb works from this site https://www.cs.rutgers.edu/~pxk/416/notes/10-paging.html but I still dont understand how the kernel use the two ttbs
(Double checked the kernel code, for some reason both ttb0 and ttb1 is set, but it seems like ttb1 is never used, i set the TTB1 register to 0 and the Linux kernel continue to run as usual)

The TTBR registers are used together to determine addressing for the full 32-bit or 40-bit address space. Which register is used for what address ranges is controlled via the tXsz bits in the TTBCR. There is an entry for t0sz corresponding to TTBR0 and t1sz for TTBR1.
The page tables addressed by each TTBRx register are independent, but you typically find most Linux implementations just use TTBR0. Linux expects to be able to use a 3G/1G address space partitioning scheme, which is not supported by ARM. If you look at page B3-1345 of the ARMv7 Architecture Reference Manual, you'll see that the value of t0sz and t1sz determine the address ranges supported by TTBR0 and TTBR1 respectively. To add confusion to disorientation, it is even possible to have disjoined address spaces where TTBR0 and TTBR1 support ranges that are not contiguous, resulting in a hole in the system address space. Good times!
To answer your main question though, it is recommended by ARM that TTBR0 be used to store the offset to the page tables used by USER processes, and TTBR1 be used to store the offset to the page tables used by the KERNEL. I have yet to see a single implementation that actually does this. Almost exclusively TTBR0 is used in all cases, with TTBR1 containing a duplicate copy of the L1 tables.
So how does this work? The value of TTBR is stored as part of the process state and simply restored each time a process with switched out. This is how it is expected to work. Originally, TTBR1 would hold a constant value for the kernel tables and never be replaced or swapped out, whereas TTBR0 would be changed each time you context switch between processes. Apparently most Linux implementations for ARM have decided to just basically eliminate the use of TTBR1 and stick to using TTBR0 for everything.
If you want to test this theory on your device, try whacking TTBR1 and watch nothing happen. Then try whacking TTBR0 and watch your system crash. I've yet to encounter a single instance that didn't result in this exact same result. Long story short, TTBR1 is useless by Linux, and TTBR0 is used almost exclusively and simply swapped out.
Now, once you get to LPAE support, throw all this away and start over again. This is the implementation where you will start to see the value of t0sz and t1sz being something other than zero, and hence N as well.

I have very little knowledge about ARM architecture, but from what I read in your enclosed link, then I guess Linux implements its virtual-memory management that way:
High-order bits of the virtual address determine which one to use. The base of the table is stored in one of two base registers (TTBR0 or TTBR1), depending on whether the topmost n bits of the virtual address are 0 (use TTBR0) or not (use TTBR1). The value for n is defined by the Translation Table Base Control Register (TTBCR).
The register TTBCR tells which addresses will be translated from page-tables pointed to by TTBR0 or TTBR1. If TTBCR contains 0xc000000, then any address from 0 to 0xbfffffff is translated by the page-table pointed by TTBR0, and any address from 0xc0000000 to 0xffffffff is translated by the page-table pointed by TTBR1. That match the Linux memory-split of 3GB for user process / 1GB for the kernel.
This allows one to have a design where the operating system and memory-mapped I/O are located in the upper part of the address space and managed by the page table in TTBR1 and user processes are in the lower part of memory and managed by the page table in TTB0. On a context switch, the operating system has to change TTBR0 to point to the first-level table for the new process. TTBR1 will still contain the memory map for the operating system and memory-mapped I/O.
Hence, the value of TTBR1 should never change because you want the kernel to be permanently mapped (think of what happens when an interrupt is raised). On the other hand, TTBR0 is modified at every process-switch, it contains the page-table of the current process.

See http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0211k/Bihgfcgf.html
For ARM5 and lower the TTB table is fixed in size and alignment (to 16k). Each level 1 entry represents 1MB. The table entry is 32bits (16k*1M/(32bit/8) = 4GB). The TTBCR controls TTBR0 table size. From the above URL,
Selecting which Translation Table Base Register is used
The Translation Table Base Register is selected as follows:
If N = 0, always use Translation Table Base Register 0.
- This is the default case at reset. It is backwards compatible with ARMv5 or earlier processors.
If N is greater than 0, then:
- if bits [31:32-N] of the Virtual Address are all 0, use Translation Table Base Register 0 otherwise use Translation Table Base Register 1.
So the size of TTBR0 also sets the memory split. For a traditional Linux 3G/1G 1G/3G, the value 2 should be selected. 4kB table == 1G memory == bits 31..30 are zero. For a value of 6 the table is 256byte == 64MB == bits 31..26 are zero.
In Linux parlance these are page global entries (and this splits this page global directory). The entries can point to another table or just be a 1MB segment. The next table entries are page middle Linux directories and then the final page table entries. I think the page middle entries are unused on the ARM.
The MMU hardware doesn't walk the tables every time. There is a TLB (translation look aside buffer). It is like a cache for the MMU tables. When the OS updates these tables, the TLB must be flushed or the processor will use stale entries. Similarly the ARM cache is virtual tagged, so changing the mapping may also mean the cache must be flushed. For these reasons, you never want to change things on a context switch. Shared libraries text (say libc.so) should be the same on a context switch. Hopefully each process has libc.so mapped at the same virtual address. There is a big gain in doing this; lower memory use and good I-cache use.
The domain and PID registers as well as supervisor/user modes can also control memory accesses. These are single registers that can be toggled on a context switch.
See http://lwn.net/images/conf/rtlws11/papers/proc/p01.pdf for info on PID and domain use on the ARMV5. The current Linux source doesn't do exactly like the paper describes. It is entirely possible that Linux doesn't need to use this mechanism and sets the TTBCR to zero so that the VM code for ARM sub-architectures is similar.
Edit: I don't believe the TTBCR functionality can be used to achieve a 3G/1G split. I think the Rutger's page was discussing the TTBCR generically and not in the Linux context. Also, at least the 2.6.38 Linux used domains or DACR but does not use the pid or fcse as it supports a limited number of processes.
http://lwn.net/Articles/106177/ - also referenced on the Rutgers page.

The TTBR0 holds the base address of translation table 0, and information about the memory it occupies.
This is one of the translation tables for the stage 1 translation of memory accesses from modes other than Hyp mode

redis bgsave failed because fork Cannot allocate memory

all:
here is my server memory info with 'free -m'
total used free shared buffers cached
Mem: 64433 49259 15174 0 3 31
-/+ buffers/cache: 49224 15209
Swap: 8197 184 8012
my redis-server has used 46G memory, there is almost 15G memory left free
As my knowledge,fork is copy on write, it should not failed when there has 15G free memory,which is enough to malloc necessary kernel structures .
besides, when redis-server used 42G memory, bgsave is ok and fork is ok too.
Is there any vm parameter I can tune to make fork return success ?

More specifically, from the Redis FAQ
Redis background saving schema relies on the copy-on-write semantic of fork in modern operating systems: Redis forks (creates a child process) that is an exact copy of the parent. The child process dumps the DB on disk and finally exits. In theory the child should use as much memory as the parent being a copy, but actually thanks to the copy-on-write semantic implemented by most modern operating systems the parent and child process will share the common memory pages. A page will be duplicated only when it changes in the child or in the parent. Since in theory all the pages may change while the child process is saving, Linux can't tell in advance how much memory the child will take, so if the overcommit_memory setting is set to zero fork will fail unless there is as much free RAM as required to really duplicate all the parent memory pages, with the result that if you have a Redis dataset of 3 GB and just 2 GB of free memory it will fail.
Setting overcommit_memory to 1 says Linux to relax and perform the fork in a more optimistic allocation fashion, and this is indeed what you want for Redis.
Redis doesn't need as much memory as the OS thinks it does to write to disk, so may pre-emptively fail the fork.

Modify /etc/sysctl.conf and add:
vm.overcommit_memory=1
Then restart sysctl with:
On FreeBSD:
sudo /etc/rc.d/sysctl reload
On Linux:
sudo sysctl -p /etc/sysctl.conf

From proc(5) man pages:
/proc/sys/vm/overcommit_memory
This file contains the kernel virtual memory accounting mode. Values are:
0: heuristic overcommit (this is the default)
1: always overcommit, never check
2: always check, never overcommit
In mode 0, calls of mmap(2) with MAP_NORESERVE set are not checked, and the default check is very weak, leading to the risk of getting a process "OOM-killed". Under Linux 2.4
any non-zero value implies mode 1. In mode 2 (available since Linux 2.6), the total virtual address space on the system is limited to (SS + RAM*(r/100)), where SS is the size
of the swap space, and RAM is the size of the physical memory, and r is the contents of the file /proc/sys/vm/overcommit_ratio.

Redis's fork-based snapshotting method can effectively double physical memory usage and easily OOM in cases like yours. Reliance on linux virtual memory for doing snapshotting is problematic, because Linux has no visibility into Redis data structures.
Recently a new redis-compatible project Dragonfly has been released. Among other things, it solves the OOM problem entirely. (disclosure - I am the author of this project).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio