Duplicated kernel page table entries - linux-kernel

I wrote an application that outputs -only kernel- page directory entries and page table entries starting from CR3 of the current process
In the logs, I can see distinct PDEs with different PTEs (which is normal). However, I can see also that some PTEs (in the same PDE) all point to the same page table, this is repeated many time
example:
[ 5437.670588] page_table: 4890000 global: 0 dirty: 1 u/s: 0
read/write: 1 present: 1
[ 5437.670815] page_table: 4890000 global: 1 dirty: 0 u/s: 0
read/write: 1 present: 1
Here we can see they point to the same page table but, only the flag global is different
Does any one has an explanation for this ? Thanks in advance :)

Related

Understanding Page Tables in Linux/seL4

Why are entries in the Page Global Directory offset? What is the significance of the offset, if any?
Page Global Directory
Address
Entry 1
Entry 2
0000000080036000:
0x0000000000000000
0x0000000000000000
...
...
...
0000000080036bf0:
0x0000000000000000
0x0000000000000000
0000000080036c00:
0x000000002000d401
0x0000000000000000
0000000080036c10:
0x0000000000000000
0x0000000000000000
...
...
...
0000000080036ff0:
0x0000000000000000
0x0000000000000000
Why not start the entries at position 0 at 0x80036000?
Linux and seL4 for RISC-V (sv39) use a three level page table that has a Page Global Directory (PGD), Page Middle Directory (PMD), and Page Table Entries (PTE). The PTE entries point to executable data. It's exactly this: http://www.science.unitn.it/~fiorella/guidelinux/tlk/node34.html
Each table is 4096 bytes or 0x1000. The entries are 64 bytes each. Each table can have 512 entries (0x200). Some of the tables--particularly PGD and PMD have entries that are offset. In other words, instead of starting at position 0 in the table, the entries are 0 until halfway or even 3/4 of the way through the table. I'm trying to understand why that is.
The question is about the location within the table rather than the contents of the location. That is why start at 0x80036c00 vs. what 0x2000d401 means? I know that 0x2000d401 points to an entry in the PMD which points to an entry in PTE and that finally points to executable code.
I can walk the page table conceptually without a problem. My problem is that I've moved the payload in my binary and modified the page tables to use 4KB instead of 2MB pages. This works in a special case, but no the general case and I'm trying to understand why. I suspect I've got something wrong in the page tables based on QEMU outputs I get.
Why are entries in the Page Global Directory offset? What is the significance of the offset, if any?
The entries in the PGD are not offsets they are pointers to another page table. The offsets in the page tables are contained within the virtual address.
Linux and seL4 use a three level page table that has a Page Global Directory (PGD), Page Middle Directory (PMD), and Page Table Entries (PTE).
To my knowledge, this is outdated information. I don't know anything about seL4 but Linux uses 4 level page tables. This probably also depends on the architecture like x86-64 vs ARM. Even then, I'm quite sure Linux uses 4 levels regardless of the architecture and simply discards some levels based on the architecture's requirement.
Each table is 4096 bytes or 0x1000. The entries are 64 bytes each. Each table can have 512 entries (0x200). Some of the tables--particularly PGD and PMD have entries that are offset. In other words, instead of starting at position 0 in the table, the entries are 0 until halfway or even 3/4 of the way through the table. I'm trying to understand why that is.
This paragraph gives me a hint (maybe I'm wrong) that your question is regarding the x86-64 architecture. On x86-64 there is 4 levels of page tables. Paging is partially implemented by hardware in that the MMU of the processor will automatically cross the page tables to translate virtual addresses to physical ones. It is also partially implemented by software (the operating-system) in that the OS will fill the page tables and such.
There are several reasons why some portions of the page tables could be zero. What I'm thinking of is maybe the portion of the memory represented by the page table entry is not in the address space of the process. This means that a user mode process can access this address but it will trigger a page fault and the kernel should thus kill the process.
Also, if it is a kernel page table, it could only be that a portion of the virtual address space was currently unused by the kernel and can thus be discarded by placing zeroes in the different entries. Several reasons can explain why some portions of the page tables are zero.
It is certain that it is always a good idea to fill the page tables with zeroes before "giving" the page table to a user mode process because otherwise an access that touches the address space of another process could happen and this would be a security threat. For example, if you have an entry in the page tables of a user mode process that wasn't zeroed out, you could have a process which accesses that virtual address. It could have the present bit set (it would not trigger a page fault) and could translate to the address space of another process. This would be a vulnerability.
Don't forget that page tables are also memory protection because they isolate one process from another in that the page tables should not translate to addresses belonging to another process. They also have the user vs supervisor bit which allows to isolate user mode from the kernel.
As to paging, the 4 level paging scheme on x86-64 has PML4, PGD, PDT and PT. The virtual address looks like the following (in binary):
Index in PML4 Index in PGD Index in PDT Index in PT Offset in physical frame (12 bits)
0b000000000 000000000 000000000 000000000 000000000000
Here I represented only 48 bits because the upper 16 bits are unused (unless 5 level paging is enabled which is present on newer processors). Basically, each 9 bits is an index/offset in the corresponding page table. The 12 least significant bits are the offset in the physical frame. Linux simply gave general names to its page tables because it supports several architectures. The names on Linux are PGD, PUD (Page Upper Directory), PMD and PTE.
The page table entry points to another page table which will be used as the table for the translation using the offset found in the virtual address.
Hope this clears any misconceptions! Feel free to comment for anything.

How does xv6 knows where last element of p->pgdir is?

In xv6 each process has a struct called proc which stores the process's state.
it has a field called pgdir which point to it's page directory.
My question is that: It should store the index of the last element in its page directory. I mean if it wants to allocate a new page table it should put a reference of it in pgdir.
My question is how does it know where the next element of page directory is?
This image explains my question more:
Thanks for your help.
I asked from some people in the real world and I understood p->sz stores the index of the first free element in the process's page directory
The process always keep track on how much bytes his program uses with sz field in the proc struct. With this knowledge, it's easy to calculate what is the last page table entry and page directory entry used.
For example, if a program uses 8000 bytes at the moment (meaning sz = 8000):
In XV6, each memory page is 4096 bytes (PGSIZE) and there are 1024 page table entries in each page directory entry. therefore, each page directory entry can point to 4096 * 1024 bytes (4 MB) and each page table entry can point to 4096 bytes (4 KB).
That means that the process last page directory entry is: sz / 4 MB (rounded down).
and that the process last page table in the lasts page directory is: sz / 4 KB (rounded down). In sz = 8000 example, that means: page directory entry 0 (first) & page table entry 1 (second).

How bad is repeated select * on an empty table in h2 database (v1.4.195)?

So I have a h2 database event table, that I am monitoring for events. There is a thread that fires every 2 secs and checks with select * from eventTable limit 10 offset 0.
I was wondering what is the performance impact of this hammering in an h2 database table. It is B-tree based but the db itself is a file. Does the h2db go to the file and has to read blocks and so to determine if table is empty. Think Oracle Db and High Water Mark problem for querying tables with large rows that get deleted later on without truncate and this causes unnecessary read of blocks to get select * done and is bad for performance.
If at all this is bad, would swapping out the thread part be recommended with the Trigger approach for Insert operations described in this qt here.
Regards
Here are some numbers from io monitoring:
I started my application that has the monitoring thread with the select statement. Then I started "sudo fsusage MYPID". The pattern with 4 reads and 2 writes repeats itself:
16:18:10.809774 pread F=44 B=0x400 O=0x00037000 0.000020 java.11010
16:18:10.809809 pread F=44 B=0x400 O=0x00037000 0.000005 java.11010
16:18:10.809825 pread F=44 B=0x400 O=0x00037000 0.000003 java.11010
16:18:10.809839 pread F=44 B=0x400 O=0x00037000 0.000004 java.11010
16:18:10.810044 pwrite F=44 B=0x1000 O=0x00031000 0.000034 java.11010
16:18:10.810087 pwrite F=44 B=0x2000 O=0x00000000 0.000010 java.11010
FD is file desciptor and is through lsof -p PID confirmed as the database file. B= no of bytes count read or written. O is offset in file.
I see the above pattern of read and writes constantly. The reads are than most likely the selects. There is no other activity on DB. So even for empty tables I am reading something like 1600 bytes consistently and making 2 writes in the range of 3000 to 4000 bytes.
But than I went into more detail, since I am on macosx, strace is not an option but Dtruss works nicely. So just gave dtruss -a -p PID and following is the relevant output for the reads and writes:
632/0x653a: 2229289 37 24 pread(0x2C, "chunk:3157,block:31,len:2,map:9,max:1540,next:4d,pages:6,root:c55c0000027cf,time:18ffdc4,version:3157 \n\0", 0x400, 0x31000) = 1024 0
632/0x652c: 773689 86 2 gettimeofday(0x70000B107C68, 0x0, 0x0) = 0 0
632/0x653a: 2229327 13 5 pread(0x2C, "chunk:3157,block:31,len:2,map:9,max:1540,next:4d,pages:6,root:c55c0000027cf,time:18ffdc4,version:3157 \n\0", 0x400, 0x31000) = 1024 0
632/0x653a: 2229347 10 4 pread(0x2C, "chunk:3157,block:31,len:2,map:9,max:1540,next:4d,pages:6,root:c55c0000027cf,time:18ffdc4,version:3157 \n\0", 0x400, 0x31000) = 1024 0
632/0x653a: 2229373 11 4 pread(0x2C, "chunk:3157,block:31,len:2,map:9,max:1540,next:4d,pages:6,root:c55c0000027cf,time:18ffdc4,version:3157 \n\0", 0x400, 0x31000) = 1024 0
632/0x653a: 2229621 45 34 pwrite(0x2C, "chunk:3159,block:24,len:1,map:9,max:b80,next:35,pages:4,root:c5640000027cf,time:19001ef,version:3159 \n\0", 0x1000, 0x24000) = 4096 0
632/0x653a: 2229686 32 24 pwrite(0x2C, "H:2,block:24,blockSize:1000,chunk:3159,created:1610362b746,format:1,version:3159,fletcher:1d05a51e\n\0", 0x2000, 0x0) = 8192 0
So adding above the return values of pread and pwrite I can see actual read is 1024 x 4 Bytes and writes is 4096 + 8192 bytes. And also one can see what is read and written. The last write sometime appears and sometimes doesn't. The 1st param to fread and fwrite is the file descriptor 0x2c which matches that of the database file. And the 2nd param is the buffer being written. I wonder why we need to write here anything though. But that got explained when I read the following architecture explanation in h2 project page:
The above writes and reads can be explained by h2database.com/html/mvstore.html#fileFormat
Browsing the source code I find that the BackgroundWriterThread class, which i noticed in the profiler as well churning bytes up as time goes by ( but no memory leaks, it cleans up properly), is waking up every second and just blindly commiting the store. That gives the location of the writes and reads above in code.
More googling revelead the problem was discussed here in the google group, though no resolution occured except someone later on posting that the parameter WRITE_DELAY does the trick for him. groups.google.com/forum/#!searchin/h2-database/… Then I wondered if he did not try setting autoCommit on connection to false. I tried and the above pattern of read and writes stopped for me.
So adding ;AUTOCOMMIT=OFF to the connection parameter does the trick and the query is in memory, so overhead of select * from an empty table is quite minimum. This ends this investigation for me. The data is in memory as I am using the Version 1.4.195 and that has the MVStore database as default. So querying an empty table in memory should be a relatively inexpensive operation.
Regards

Coredump size different than process virtual memory space

I'm working on OS X 10.11, and generated dump file in the following manner :
1. ulimit -c unlimited
2. kill -10 5228 (process pid)
and got dump file with the rolling attributes : 642M Jun 26 15:00 core.5228
Right before that, I checked the process total memory space using vmmap command to try and estimate the expected dump size.
However, the estimation (238.7Mb) was much smaller than the actual size (642Mb).
Can this gap be explained ?
VIRTUAL REGION
REGION TYPE SIZE COUNT (non-coalesced)
=========== ======= =======
Activity Tracing 2048K 2
Kernel Alloc Once 4K 2
MALLOC guard page 16K 4
MALLOC metadata 180K 6
MALLOC_SMALL 56.0M 4 see MALLOC ZONE table below
MALLOC_SMALL (empty) 8192K 2 see MALLOC ZONE table below
MALLOC_TINY 8192K 3 see MALLOC ZONE table below
STACK GUARD 56.0M 2
Stack 8192K 2
__DATA 1512K 44
__LINKEDIT 90.9M 4
__TEXT 8336K 44
shared memory 12K 4
=========== ======= =======
TOTAL 238.7M 110
VIRTUAL ALLOCATION BYTES REGION
MALLOC ZONE SIZE COUNT ALLOCATED % FULL COUNT
=========== ======= ========= ========= ====== ======
DefaultMallocZone_0x100e42000 72.0M 7096 427K 0% 6
coredump can, and does, filter the process memory. See the core man page:
Controlling which mappings are written to the core dump
Since kernel 2.6.23, the Linux-specific /proc/PID/coredump_filter file can be used to control which memory segments are written to the core dump file in the event that a core dump is performed for the process with the corresponding process ID.
The value in the file is a bit mask of memory mapping types (see mmap(2)). If a bit is set in the mask, then memory mappings of the corresponding type are dumped; otherwise they are not dumped. The bits in this file have the following meanings:
bit 0 Dump anonymous private mappings.
bit 1 Dump anonymous shared mappings.
bit 2 Dump file-backed private mappings.
bit 3 Dump file-backed shared mappings.
bit 4 (since Linux 2.6.24)
Dump ELF headers.
bit 5 (since Linux 2.6.28)
Dump private huge pages.
bit 6 (since Linux 2.6.28)
Dump shared huge pages.
bit 7 (since Linux 4.4)
Dump private DAX pages.
bit 8 (since Linux 4.4)
Dump shared DAX pages.
By default, the following bits are set: 0, 1, 4 (if the CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS kernel configuration option is enabled), and 5. This default can be modified at boot time using the coredump_filter boot option.
I assume OS X behaves similarly.

How to calculate Virtual Memory Size in Mavericks

I would like to know if there is a command/API call (or set of commands/API calls) that calculates each of (Virtual Memory, File Cache and App Memory) parameters listed in the screen shot above.
You can use vm_stat and sysctl terminal commands. Although there was no straightforward way or documentation on how to extract the new attributes from these commands, we had to do some trial and error till we discovered the relations between parameters in the commands and the attribute we need to calculate.
The Steps are as the following:
Run vm_stat
Run "sysctl hw.memsize" and "sysctl vm.swapusage".
The relationship between Memory usage which appears in Activity Monitor and previous commands are described in How to calc Memory usage in Mavericks programmatically.
Sample output from vm_stat:
Mach Virtual Memory Statistics: (page size of 4096 bytes)
Pages free: 24428.
Pages active: 1039653.
Pages inactive: 626002.
Pages speculative: 184530.
Pages throttled: 0.
Pages wired down: 156244.
Pages purgeable: 9429.
"Translation faults": 14335334.
Pages copy-on-write: 557301.
Pages zero filled: 5682527.
Pages reactivated: 74.
Pages purged: 52633.
File-backed pages: 660167.
Anonymous pages: 1190018.
Pages stored in compressor: 644.
Pages occupied by compressor: 603.
Decompressions: 18.
Compressions: 859.
Pageins: 253589.
Pageouts: 0.
Swapins: 0.
Swapouts: 0.

Resources