How does the MMU/Kernel knows that a PTE is valid? - memory-management

Assuming a two-level page table, let's say the program gets allocated 2 pages (8KB)
GPD PD1 PD2 PD3
+-----+ +------+ +------+ +------+
+ 1 + + 22 + +------+ +------+
+ 0 + + 62 + +------+ +------+
+ 0 + + 0 + +------+ +------+
+-----+ +------+ +------+ +------+
How does the MMU/kernel knows when a program attempts to access the third entry of PD1 or like the 2,3 entry of GPD?
Does it initialize the unused PTEs with some value in order to distinguish the unused ones? (like with 0 or something)
I have read in here and there is no valid bit or something in the pte flags

There is a function named pte_none. It can test whether a PTE is empty or not. If a PTE is invalid, its value will be zero and the function pte_none will return TRUE.
If a page table entry is zero, the present bit is also zero, so the mapping from virtual address to physical address does not exist. If a program wants to access this invalid virtual address, MMU will check the present bit and raise an exception called page fault and then CPU will execute an exception handler which is registered by Interrupt Descriptor Table(IDT). What the handler will do depends on the type of this exception. The result may be loading the desired page or terminating the program.

Related

Windows 10 x64: Unable to get PXE on Windbg

Can't understand how Windows Memory Manager works.
I look at the attached user process (dbgview.exe).
It is WOW64-process. At the specified address (0x76560000) there is .text section of the kernel32.dll module (also WOW64).
Why there is no PTE and other tables in the process page table pointing to those virtual address?
kd> db 76560000
00000000`76560000 8b ff 55 8b ec 51 56 57-33 f6 89 55 fc 56 68 80 ..U..QVW3..U.Vh.
<...>
kd> !pte 76560000
VA 0000000076560000
PXE at FFFFF6FB7DBED000 PPE at FFFFF6FB7DA00008 PDE at FFFFF6FB40001D90 PTE at FFFFF680003B2B00
Unable to get PXE FFFFF6FB7DBED000
kd> db FFFFF680003B2B00
fffff680`003b2b00 ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ?? ???????????????
<...>
I know that pages will be allocated after first access (with page fault) have occured, but why there is no protype PTE too?
Firstly, translate an arbitrary virtual address to physical using !vtop to see the dirbase of the process in the process of translation, or use !process to find the dirbase of the process:
lkd> .process /p fffffa8046a2e5f0
Implicit process is now fffffa80`46a2e5f0
lkd> .context 77fa90000
lkd> !vtop 0 13fe60000
Amd64VtoP: Virt 00000001`3fe60000, pagedir 7`7fa90000
Amd64VtoP: PML4E 7`7fa90000
Amd64VtoP: PDPE 1`c2e83020
Amd64VtoP: PDE 7`84e04ff8
Amd64VtoP: PTE 4`be585300
Amd64VtoP: Mapped phys 6`3efae000
Virtual address 13fe60000 translates to physical address 63efae000.
Then find that physical frame in the PFN database (in this case the physical page for PML4 (cr3 page aka. dirbase) is 77fa90 with full physical address 77fa90000:
lkd> !pfn 77fa90
PFN 0077FA90 at address FFFFFA80167EFB00
flink FFFFFA8046A2E5F0 blink / share count 00000005 pteaddress FFFFF6FB7DBEDF68
reference count 0001 used entry count 0000 Cached color 0 Priority 0
restore pte 00000080 containing page 77FA90 Active M
Modified
The address FFFFF6FB7DBED000 is therefore the virtual address of the PML4 page and FFFFF6FB7DBEDF68 is the virtual address of the PML4E self reference entry (1ed*8 = f68).
FFFFF6FB7DBED000 = 1111111111111111111101101111101101111101101111101101000000000000
1111111111111111 111101101 111101101 111101101 111101101 000000000000
The PML4 can only be at a virtual address where the PML4E, PDTPE, PDE and PTE index are the same, so there are actually 2^9 different combinations of that and windows 7 always selects 0x1ed i.e. 111101101. The reason for this is because the PML4 contains a PML4 that points to itself i.e. the physical frame of the PML4, so it will need to keep indexing to that same location at every level of the hierarchy.
The PML4, being a page table page, must reside in the kernel, and kernel addresses are high-canonical, i.e. prefixed with 1111111111111111, and kernel addresses begin with 00001 through 11111 i.e. from 08 to ff
The range of possible addresses that a 64 bit OS that uses 8TiB for user address space can place it at is therefore 31*(2^4) = 496 different possible locations and not actually 2^9:
1111111111111111 000010000 000010000 000010000 000010000 000000000000
1111111111111111 111111111 111111111 111111111 111111111 000000000000
I.e. the first is FFFF080402010000, the second is FFFF088442211000, the last is FFFFFFFFFFFFF000.
Note:
Up until Windows 10 TH2, the magic index for the Self-Reference PML4 entry was 0x1ed as mentioned above. But what about Windows 10 from 1607? Well Microsoft uped their game, as a constant battle for improving Windows security the index is randomized at boot-time, so 0x1ed is now one of the 512 [sic. (496)] possible values (i.e. 9-bit index) that the Self-Reference entry index can have. And side effect, it also broke some of their own tools, like the !pte2va WinDbg command.
0xFFFFF68000000000 is the address of the first PTE in the first page table page, so basically MmPteBase, except because on Windows 10 1607 the PML4E can be an other than 0x1ed, the base is not always 0xFFFFF68000000000 as a result, and it uses a variable nt!MmPteBase to know instantly where the base of the page table page allocations begins. Previously, this symbol does not exist in ntoskrnl.exe, because it has a hardcoded base 0xFFFFF68000000000. The address of the first and last page table page is going to be:
first last
* pml4e_offset : 0x1ed 0x1ed
* pdpe_offset : 0x000 0x1ff
* pde_offset : 0x000 0x1ff
* pte_offset : 0x000 0x1ff
* offset : 0x000 0x000
This gives 0xFFFFF68000000000 for the first and 0xFFFFF6FFFFFFF000 for the last page table page when the PML4E index is 0x1ed. PDEs + PDPTEs + PML4Es + PTEs are assigned in this range.
Therefore, to be able to translate a virtual address to its PTE virtual address (and !pte2va is the reverse of this), you affix 111101101 to the start of the virtual address and then you truncate the last 12 bits (the page offset, which is no longer useful) and then you times it by 8 bytes (the PTE size) (i.e. add 3 zeroes to the end, which creates a new page offset from the last level index into the page that contains the PTEs times the size of a PTE structure). Concatenating the PML4E index to the start simply causes it to loop back one time such that you actually get the PTE rather than what the PTE points to. Concatenating it to the start is the same thing as adding it to MmPteBase.
Here is simple C++ code to do it:
// pte.cpp
#include<iostream>
#include<string>
int main(int argc, char *argv[]) {
unsigned long long int input = std::stoull(argv[1], nullptr, 16);
long long int ptebase = 0xFFFFF68000000000;
long long int pteaddress = ptebase + ((input >> 12) << 3);
std::cout << "0x" << std::hex << pteaddress;
}
C:\> pte 13fe60000
0xfffff680009ff300
To get the PDE virtual address you have to affix it twice and then truncate the last 21 bits and then times by 8. This is how !pte is supposed to work, and is the opposite of !pte2va.
Similarly, PDEs + PDPTEs + PML4Es are assigned in the range:
first last
* pml4e_offset : 0x1ed 0x1ed
* pdpe_offset : 0x1ed 0x1ed
* pde_offset : 0x000 0x1ff
* pte_offset : 0x000 0x1ff
* offset : 0x000 0x000
Because when you get to 0x1ed for the pdpte offset within the page table page range, all of a sudden, you are looping back in the PML4 once again, so you get the PDE.
If it says there is no PTE for an address within a virtual page for which the corresponding physical frame is shown to be part of the working set by VMMap, then you might be experiencing my issue, where you need to use .process /P if you're doing live kernel debugging (local or remote) to explicitly tell the debugger that you want to translate user and kernel addresses in the context of the process and not the debugger.
I have found that since Windows 10 Anniversary Update (1607, 10.0.14393) PML4 table had been randomized to mitigate kernel heap spraying.
It means that probably Page Table is not placed at 0xFFFFF6800000.

How to display managed objects with certain value in one of the fields in WinDbg using SOS (or SOSEX)?

My problem is this:
0:000> !DumpHeap -type Microsoft.Internal.ReadLock -stat
------------------------------
Heap 0
total 0 objects
------------------------------
Heap 1
total 0 objects
------------------------------
Heap 2
total 0 objects
------------------------------
Heap 3
total 0 objects
------------------------------
total 0 objects
Statistics:
MT Count TotalSize Class Name
000007fef3d14088 74247 2375904 Microsoft.Internal.ReadLock
Total 74247 objects
The way I read this output is that I have 74,247 Microsoft.Internal.ReadLock instances on my heap. However, some of them are probably pending collection.
I want to display only those which are not pending collection.
For example, 0000000080f88e90 is the address of one of these objects and it is garbage. I know it, because:
0:000> !mroot 0000000080f88e90
No root paths were found.
0:000> !refs 0000000080f88e90 -target
Objects referencing 0000000080f88e90 (Microsoft.Internal.ReadLock):
NONE
0:000> !do 0000000080f88e90
Name: Microsoft.Internal.ReadLock
MethodTable: 000007fef3d14088
EEClass: 000007fef3c63410
Size: 32(0x20) bytes
File: C:\Windows\Microsoft.Net\assembly\GAC_MSIL\System.ComponentModel.Composition\v4.0_4.0.0.0__b77a5c561934e089\System.ComponentModel.Composition.dll
Fields:
MT Field Offset Type VT Attr Value Name
000007fef3d13fb0 400001e 8 ...oft.Internal.Lock 0 instance 0000000080001010 _lock
000007fef0a8c7d8 400001f 10 System.Int32 1 instance 1 _isDisposed
As one can see, both sosex.mroot and sosex.refs indicate no one references it, plus dumping its fields reveals that it was disposed through IDisposable, so it makes sense that the object is garbage (I know that being disposed does not imply the object is garbage, but it is in this case).
Now I want to display all those instances which are not garbage. I guess I am to use the .foreach command. Something like this:
.foreach(entry {!dumpheap -type Microsoft.Internal.ReadLock -short}){.if (???) {.printf "%p\n", entry} }
My problem is that I have no idea what goes into the .if condition.
I am able to inspect the _isDisposed field like this:
0:000> dd 0000000080f88e90+10 L1
00000000`80f88ea0 00000001
But .if expects an expression and all I have is a command output. If I knew how to extract information from the command output and arrange it as an expression then I could use it as the .if condition and be good.
So, my question is this - is there a way to get the field value as an expression suitable for .if? Alternatively, is it possible to parse the command output in a way suitable for using the result as the .if condition?
I didn't have an example which uses ReadLock objects, but I tried with Strings and this is my result:
.foreach (entry {!dumpheap -short -type Microsoft.Internal.ReadLock})
{
.if (poi(${entry}+10) == 1)
{
.printf "%p\n", ${entry}
}
}
I'm using poi() to get pointer size data from the address. Also note I'm using ${entry} not entry in both, poi() and .printf. You might also like !do ${entry} inside the .if.
In one line for copy/paste:
.foreach (entry {!dumpheap -short -type Microsoft.Internal.ReadLock}) {.if (poi(${entry}+10) == 1) {.printf "%p\n", ${entry}}}

Stata: foreach creates too many variables -

I created a toy example of my code below.
In this toy example I would like to create a measure of all higher prices minus lower prices within a self-created reference group. So within each reference group, I would like to take each individual and subtract its price value from all higher price values from other individuals in the same group. I do not want to have negative differences. Then I would like to sum all these differences. In creating this code I found some help here:
http://www.stata.com/support/faqs/data-management/try-all-values-with-foreach/
However, the code didn't work perfectly for me, because my dataset is quite large (several 100K obs) and the examples on the website and my code only work until the numlist maximum of 1600 in Stata. (I am using version 12). The toy example with the auto dataset works, due to small size of the dataset.
I would like to ask if someone has an idea how to code this more efficiently, so that I can get around the numlist restriction. I thought about summing the differences directly without saving them in intermediate variables, but that also blow up the numlist restriction.
clear all
sysuse auto
ren headroom refgroup
bysort refgroup : egen pricerank = rank(price)
qui: su pricerank, meanonly
gen test = `r(max)'
su test
foreach i of num 1/`r(max)' {
qui: bys refgroup: gen intermediate`i' = price[_n+`i'] -price if price[_n+`i'] > price
}
egen price_diff = rowmax(intermediate*)
drop intermediate*
If I understand this correctly, this isn't even a problem that requires explicit loops. The sum of all higher prices is just the difference between two cumulative sums. You might need to think through what you want to do if prices are tied.
. clear
. set obs 10
obs was 0, now 10
. gen group = _n > 5
. set seed 2803
. gen price = ceil(1000 * runiform())
. bysort group (price) : gen sumhigherprices = sum(price)
. by group : replace sumhigherprices = sumhigherprices[_N] - sumhigherprices
(10 real changes made)
. list
+--------------------------+
| group price sumhig~s |
|--------------------------|
1. | 0 218 1448 |
2. | 0 264 1184 |
3. | 0 301 883 |
4. | 0 335 548 |
5. | 0 548 0 |
|--------------------------|
6. | 1 125 3027 |
7. | 1 213 2814 |
8. | 1 828 1986 |
9. | 1 988 998 |
10. | 1 998 0 |
+--------------------------+
Edit: For what the OP needs, there is an extra line
. by group : replace sumhigherprices = sumhigherprices - (_N - _n) * price
If I understand the wording of the problem correctly, maybe this can help. It uses joinby (new observations are created and depending on the size of the original database, you may or not hit the Stata hard-limit on number of observations). The code reproduces the results that would follow from the code of the original post. This is a second attempt. The code before this final edit did not provide the sought-after results. The wording of the problem was somewhat difficult for me to understand.
clear all
set more off
* Load data
sysuse auto
* Delete unnecessary vars
ren headroom refgroup
keep refgroup price
* Generate id´s based on rankings (sort)
bysort refgroup (price): gen id = _n
* Pretty list
order refgroup id
sort refgroup id price
list, sepby(refgroup)
* joinby procedure
tempfile main
save "`main'"
rename (price id) =0
joinby refgroup using "`main'"
list, sepby(refgroup)
* Do not compare with itself and drop duplicates
drop if id0 >= id
* Compute differences and max
gen dif = abs(price0 - price)
collapse (max) dif, by(refgroup id0)
list, sepby(refgroup)

What is the contents of the cache after loop?

A computer uses a small direct-mapped cache between the main memory and the
processor. The cache has four 16-bit words, and each word has an associated 13-bit tag,
as shown in Figure (a). When a miss occurs during a read operation, the requested
word is read from the main memory and sent to the processor. At the same time, it is
copied into the cache, and its block number is stored in the associated tag. Consider the
following loop in a program where all instructions and operands are 16 bits long:
LOOP: Add (R1)+,R0
Decrement R2
BNE LOOP
<-13 bits-> <--16bit->
0|TAG |DATA |
2| | |
4| | |
6|_______ | ______ |
(a)Cache
.
.
| A03C |<---ADDRESS 054E
| 05D9 |
| 10D7 |
.
.
(b)Main Memory
Assume that, before this loop is entered, registers R0, R1, and R2 contain 0, 054E,
and 3, respectively. Also assume that the main memory contains the data shown in
Figure (b), where all entries are given in hexadecimal notation. The loop starts at
location LOOP = 02EC.
(a) Show the contents of the cache at the end of each pass through the loop.

tracing a linux kernel, function-by function (biggest only) with us timer

I want to know, how does the linux kernel do some stuff (receiving a tcp packet). In what order main tcp functions are called. I want to see both interrupt handler (top half), bottom half and even work done by kernel after user calls "read()".
How can I get a function trace from kernel with some linear time scale?
I want to get a trace from single packet, not the profile of kernel when receiving 1000th of packets.
Kernel is 2.6.18 or 2.6.23 (supported in my debian). I can add some patches to it.
I think that the closest tool which atleast partially can achieve what you want is kernel ftrace. Here is a sample usage:
root#ansis-xeon:/sys/kernel/debug/tracing# cat available_tracers
blk function_graph mmiotrace function sched_switch nop
root#ansis-xeon:/sys/kernel/debug/tracing# echo 1 > ./tracing_on
root#ansis-xeon:/sys/kernel/debug/tracing# echo function_graph > ./current_trace
root#ansis-xeon:/sys/kernel/debug/tracing# cat trace
3) 0.379 us | __dequeue_entity();
3) 1.552 us | }
3) 0.300 us | hrtick_start_fair();
3) 2.803 us | }
3) 0.304 us | perf_event_task_sched_out();
3) 0.287 us | __phys_addr();
3) 0.382 us | native_load_sp0();
3) 0.290 us | native_load_tls();
------------------------------------------
3) <idle>-0 => ubuntuo-2079
------------------------------------------
3) 0.509 us | __math_state_restore();
3) | finish_task_switch() {
3) 0.337 us | perf_event_task_sched_in();
3) 0.971 us | }
3) ! 100015.0 us | }
3) | hrtimer_cancel() {
3) | hrtimer_try_to_cancel() {
3) | lock_hrtimer_base() {
3) 0.327 us | _spin_lock_irqsave();
3) 0.897 us | }
3) 0.305 us | _spin_unlock_irqrestore();
3) 2.185 us | }
3) 2.749 us | }
3) ! 100022.5 us | }
3) ! 100023.2 us | }
3) 0.704 us | fget_light();
3) 0.522 us | pipe_poll();
3) 0.342 us | fput();
3) 0.476 us | fget_light();
3) 0.467 us | pipe_poll();
3) 0.292 us | fput();
3) 0.394 us | fget_light();
3) | inotify_poll() {
3) | mutex_lock() {
3) 0.285 us | _cond_resched();
3) 1.134 us | }
3) 0.289 us | fsnotify_notify_queue_is_empty();
3) | mutex_unlock() {
3) 2.987 us | }
3) 0.292 us | fput();
3) 0.517 us | fget_light();
3) 0.415 us | pipe_poll();
3) 0.292 us | fput();
3) 0.504 us | fget_light();
3) | sock_poll() {
3) 0.480 us | unix_poll();
3) 4.224 us | }
3) 0.183 us | fput();
3) 0.341 us | fget_light();
3) | sock_poll() {
3) 0.274 us | unix_poll();
3) 0.731 us | }
3) 0.182 us | fput();
3) 0.269 us | fget_light();
It is not perfect because it does not print function parameters and misses some static functions, but you can get the fell of who is calling who inside the kernel.
If this is not enough then use GDB. But as you might already know setting up GDB for kernel debugging is not as easy as it is for user space processes. I prefer to use GDB+qemu if ever needed.
Happy tracing!
Update: On later Linux distributions I suggest to use trace-cmd command line tool that is "wrapper" around /sys/kernel/debug/tracing. trace-cmd is so much more intuitive to use than the raw interface kernel provides.
You want oprofile. It can give you timings for (a selected subset of) your entire system, which means you can trace network activity from the device to the application and back again, through the kernel and all the libraries.
I can't immediately see a way to trace only a single packet at a time. In particular, it may be difficult to get such fine grained tracing since the upper-half of an interrupt handler shouldn't be doing anything blocking (this is dead-lock prone).
Maybe this is overly pedantic, but have you taken a look at the source code? I know from experience the TCP layer is very well commented, both in recording intent and referencing the RFCs.
I can highly recommend TCP/IP Illustrated, esp Volume 2, the Implementation (website). Its very good, and walks through the code line by line. From the description, "Combining 500 illustrations with 15,000 lines of real, working code...". The source code included is from the BSD kernel, but the stacks are extremely similar, and comparing the two is often instructive.
This question is old and probably not relevant to the original poster, but a nice trick I used lately which was helpful to me was to set the "mark" field of the sk_buf to some value and only "printk" if the value matches.
So for example, if you know where the top half IRQ handler (as the question suggests), then you can hard code some check (e.g. tcp port, source IP, source MAC address, well you get the point) and set the mark to some arbitrary value (e.g. skb->mark = 0x9999).
Then all the way up you only printk if the mark has the same value. As long as nobody else changes your mark (which as far as I could see is usually the case in typical settings) then you'll only see the packets you're interested in.
Since most interesting functions get the skb, it works for almost everything that could be interesting.

Resources