Can't understand how Windows Memory Manager works.
I look at the attached user process (dbgview.exe).
It is WOW64-process. At the specified address (0x76560000) there is .text section of the kernel32.dll module (also WOW64).
Why there is no PTE and other tables in the process page table pointing to those virtual address?
kd> db 76560000
00000000`76560000 8b ff 55 8b ec 51 56 57-33 f6 89 55 fc 56 68 80 ..U..QVW3..U.Vh.
<...>
kd> !pte 76560000
VA 0000000076560000
PXE at FFFFF6FB7DBED000 PPE at FFFFF6FB7DA00008 PDE at FFFFF6FB40001D90 PTE at FFFFF680003B2B00
Unable to get PXE FFFFF6FB7DBED000
kd> db FFFFF680003B2B00
fffff680`003b2b00 ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ?? ???????????????
<...>
I know that pages will be allocated after first access (with page fault) have occured, but why there is no protype PTE too?
Firstly, translate an arbitrary virtual address to physical using !vtop to see the dirbase of the process in the process of translation, or use !process to find the dirbase of the process:
lkd> .process /p fffffa8046a2e5f0
Implicit process is now fffffa80`46a2e5f0
lkd> .context 77fa90000
lkd> !vtop 0 13fe60000
Amd64VtoP: Virt 00000001`3fe60000, pagedir 7`7fa90000
Amd64VtoP: PML4E 7`7fa90000
Amd64VtoP: PDPE 1`c2e83020
Amd64VtoP: PDE 7`84e04ff8
Amd64VtoP: PTE 4`be585300
Amd64VtoP: Mapped phys 6`3efae000
Virtual address 13fe60000 translates to physical address 63efae000.
Then find that physical frame in the PFN database (in this case the physical page for PML4 (cr3 page aka. dirbase) is 77fa90 with full physical address 77fa90000:
lkd> !pfn 77fa90
PFN 0077FA90 at address FFFFFA80167EFB00
flink FFFFFA8046A2E5F0 blink / share count 00000005 pteaddress FFFFF6FB7DBEDF68
reference count 0001 used entry count 0000 Cached color 0 Priority 0
restore pte 00000080 containing page 77FA90 Active M
Modified
The address FFFFF6FB7DBED000 is therefore the virtual address of the PML4 page and FFFFF6FB7DBEDF68 is the virtual address of the PML4E self reference entry (1ed*8 = f68).
FFFFF6FB7DBED000 = 1111111111111111111101101111101101111101101111101101000000000000
1111111111111111 111101101 111101101 111101101 111101101 000000000000
The PML4 can only be at a virtual address where the PML4E, PDTPE, PDE and PTE index are the same, so there are actually 2^9 different combinations of that and windows 7 always selects 0x1ed i.e. 111101101. The reason for this is because the PML4 contains a PML4 that points to itself i.e. the physical frame of the PML4, so it will need to keep indexing to that same location at every level of the hierarchy.
The PML4, being a page table page, must reside in the kernel, and kernel addresses are high-canonical, i.e. prefixed with 1111111111111111, and kernel addresses begin with 00001 through 11111 i.e. from 08 to ff
The range of possible addresses that a 64 bit OS that uses 8TiB for user address space can place it at is therefore 31*(2^4) = 496 different possible locations and not actually 2^9:
1111111111111111 000010000 000010000 000010000 000010000 000000000000
1111111111111111 111111111 111111111 111111111 111111111 000000000000
I.e. the first is FFFF080402010000, the second is FFFF088442211000, the last is FFFFFFFFFFFFF000.
Note:
Up until Windows 10 TH2, the magic index for the Self-Reference PML4 entry was 0x1ed as mentioned above. But what about Windows 10 from 1607? Well Microsoft uped their game, as a constant battle for improving Windows security the index is randomized at boot-time, so 0x1ed is now one of the 512 [sic. (496)] possible values (i.e. 9-bit index) that the Self-Reference entry index can have. And side effect, it also broke some of their own tools, like the !pte2va WinDbg command.
0xFFFFF68000000000 is the address of the first PTE in the first page table page, so basically MmPteBase, except because on Windows 10 1607 the PML4E can be an other than 0x1ed, the base is not always 0xFFFFF68000000000 as a result, and it uses a variable nt!MmPteBase to know instantly where the base of the page table page allocations begins. Previously, this symbol does not exist in ntoskrnl.exe, because it has a hardcoded base 0xFFFFF68000000000. The address of the first and last page table page is going to be:
first last
* pml4e_offset : 0x1ed 0x1ed
* pdpe_offset : 0x000 0x1ff
* pde_offset : 0x000 0x1ff
* pte_offset : 0x000 0x1ff
* offset : 0x000 0x000
This gives 0xFFFFF68000000000 for the first and 0xFFFFF6FFFFFFF000 for the last page table page when the PML4E index is 0x1ed. PDEs + PDPTEs + PML4Es + PTEs are assigned in this range.
Therefore, to be able to translate a virtual address to its PTE virtual address (and !pte2va is the reverse of this), you affix 111101101 to the start of the virtual address and then you truncate the last 12 bits (the page offset, which is no longer useful) and then you times it by 8 bytes (the PTE size) (i.e. add 3 zeroes to the end, which creates a new page offset from the last level index into the page that contains the PTEs times the size of a PTE structure). Concatenating the PML4E index to the start simply causes it to loop back one time such that you actually get the PTE rather than what the PTE points to. Concatenating it to the start is the same thing as adding it to MmPteBase.
Here is simple C++ code to do it:
// pte.cpp
#include<iostream>
#include<string>
int main(int argc, char *argv[]) {
unsigned long long int input = std::stoull(argv[1], nullptr, 16);
long long int ptebase = 0xFFFFF68000000000;
long long int pteaddress = ptebase + ((input >> 12) << 3);
std::cout << "0x" << std::hex << pteaddress;
}
C:\> pte 13fe60000
0xfffff680009ff300
To get the PDE virtual address you have to affix it twice and then truncate the last 21 bits and then times by 8. This is how !pte is supposed to work, and is the opposite of !pte2va.
Similarly, PDEs + PDPTEs + PML4Es are assigned in the range:
first last
* pml4e_offset : 0x1ed 0x1ed
* pdpe_offset : 0x1ed 0x1ed
* pde_offset : 0x000 0x1ff
* pte_offset : 0x000 0x1ff
* offset : 0x000 0x000
Because when you get to 0x1ed for the pdpte offset within the page table page range, all of a sudden, you are looping back in the PML4 once again, so you get the PDE.
If it says there is no PTE for an address within a virtual page for which the corresponding physical frame is shown to be part of the working set by VMMap, then you might be experiencing my issue, where you need to use .process /P if you're doing live kernel debugging (local or remote) to explicitly tell the debugger that you want to translate user and kernel addresses in the context of the process and not the debugger.
I have found that since Windows 10 Anniversary Update (1607, 10.0.14393) PML4 table had been randomized to mitigate kernel heap spraying.
It means that probably Page Table is not placed at 0xFFFFF6800000.
Related
We have a NDIS LWF driver, and on a single machine we get a DPC_WATCHDOG_VIOLATION 133/1 bugcheck when they try to connect to their VPN to connect to the internet. This could be related to our NdisFIndicateReceiveNetBufferLists, as the IRQL is raised to DISPATCH before calling it (and obviously lowered to whatever it was afterward), and that does appear in the output of !dpcwatchdog shown below. This is done due to a workaround for another bug explained here:
IRQL_UNEXPECTED_VALUE BSOD after NdisFIndicateReceiveNetBufferLists?
Now this is the bugcheck:
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************
DPC_WATCHDOG_VIOLATION (133)
The DPC watchdog detected a prolonged run time at an IRQL of DISPATCH_LEVEL
or above.
Arguments:
Arg1: 0000000000000001, The system cumulatively spent an extended period of time at
DISPATCH_LEVEL or above. The offending component can usually be
identified with a stack trace.
Arg2: 0000000000001e00, The watchdog period.
Arg3: fffff805422fb320, cast to nt!DPC_WATCHDOG_GLOBAL_TRIAGE_BLOCK, which contains
additional information regarding the cumulative timeout
Arg4: 0000000000000000
STACK_TEXT:
nt!KeBugCheckEx
nt!KeAccumulateTicks+0x1846b2
nt!KiUpdateRunTime+0x5d
nt!KiUpdateTime+0x4a1
nt!KeClockInterruptNotify+0x2e3
nt!HalpTimerClockInterrupt+0xe2
nt!KiCallInterruptServiceRoutine+0xa5
nt!KiInterruptSubDispatchNoLockNoEtw+0xfa
nt!KiInterruptDispatchNoLockNoEtw+0x37
nt!KxWaitForSpinLockAndAcquire+0x2c
nt!KeAcquireSpinLockAtDpcLevel+0x5c
wanarp!WanNdisReceivePackets+0x4bb
ndis!ndisMIndicateNetBufferListsToOpen+0x141
ndis!ndisMTopReceiveNetBufferLists+0x3f0e4
ndis!ndisCallReceiveHandler+0x61
ndis!ndisInvokeNextReceiveHandler+0x1df
ndis!NdisMIndicateReceiveNetBufferLists+0x104
ndiswan!IndicateRecvPacket+0x596
ndiswan!ApplyQoSAndIndicateRecvPacket+0x20b
ndiswan!ProcessPPPFrame+0x16f
ndiswan!ReceivePPP+0xb3
ndiswan!ProtoCoReceiveNetBufferListChain+0x442
ndis!ndisMCoIndicateReceiveNetBufferListsToNetBufferLists+0xf6
ndis!NdisMCoIndicateReceiveNetBufferLists+0x11
raspptp!CallIndicateReceived+0x210
raspptp!CallProcessRxNBLs+0x199
ndis!ndisDispatchIoWorkItem+0x12
nt!IopProcessWorkItem+0x135
nt!ExpWorkerThread+0x105
nt!PspSystemThreadStartup+0x55
nt!KiStartSystemThread+0x28
SYMBOL_NAME: wanarp!WanNdisReceivePackets+4bb
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: wanarp
IMAGE_NAME: wanarp.sys
And this following is the output of !dpcwatchdog, but I still can't find what is causing this bugcheck, and can't find which function is consuming too much time in DISPATCH level which is causing this bugcheck. Although I think this could be related to some spin locking done by wanarp? Could this be a bug with wanarp? Note that we don't use any spinlocking in our driver, and us raising the IRQL should not cause any issue as it is actually very common for indication in Ndis to be done at IRQL DISPATCH.
So How can I find the root cause of this bugcheck? There are no other third party LWF in the ndis stack.
3: kd> !dpcwatchdog
All durations are in seconds (1 System tick = 15.625000 milliseconds)
Circular Kernel Context Logger history: !logdump 0x2
DPC and ISR stats: !intstats /d
--------------------------------------------------
CPU#0
--------------------------------------------------
Current DPC: No Active DPC
Pending DPCs:
----------------------------------------
CPU Type KDPC Function
dpcs: no pending DPCs found
--------------------------------------------------
CPU#1
--------------------------------------------------
Current DPC: No Active DPC
Pending DPCs:
----------------------------------------
CPU Type KDPC Function
1: Normal : 0xfffff80542220e00 0xfffff805418dbf10 nt!PpmCheckPeriodicStart
1: Normal : 0xfffff80542231d40 0xfffff8054192c730 nt!KiBalanceSetManagerDeferredRoutine
1: Normal : 0xffffbd0146590868 0xfffff80541953200 nt!KiEntropyDpcRoutine
DPC Watchdog Captures Analysis for CPU #1.
DPC Watchdog capture size: 641 stacks.
Number of unique stacks: 1.
No common functions detected!
The captured stacks seem to indicate that only a single DPC or generic function is the culprit.
Try to analyse what other processors were doing at the time of the following reference capture:
CPU #1 DPC Watchdog Reference Stack (#0 of 641) - Time: 16 Min 17 Sec 984.38 mSec
# RetAddr Call Site
00 fffff805418d8991 nt!KiUpdateRunTime+0x5D
01 fffff805418d2803 nt!KiUpdateTime+0x4A1
02 fffff805418db1c2 nt!KeClockInterruptNotify+0x2E3
03 fffff80541808a45 nt!HalpTimerClockInterrupt+0xE2
04 fffff805419fab9a nt!KiCallInterruptServiceRoutine+0xA5
05 fffff805419fb107 nt!KiInterruptSubDispatchNoLockNoEtw+0xFA
06 fffff805418a9a9c nt!KiInterruptDispatchNoLockNoEtw+0x37
07 fffff805418da3cc nt!KxWaitForSpinLockAndAcquire+0x2C
08 fffff8054fa614cb nt!KeAcquireSpinLockAtDpcLevel+0x5C
09 fffff80546ba1eb1 wanarp!WanNdisReceivePackets+0x4BB
0a fffff80546be0b84 ndis!ndisMIndicateNetBufferListsToOpen+0x141
0b fffff80546ba7ef1 ndis!ndisMTopReceiveNetBufferLists+0x3F0E4
0c fffff80546bddfef ndis!ndisCallReceiveHandler+0x61
0d fffff80546ba4a94 ndis!ndisInvokeNextReceiveHandler+0x1DF
0e fffff8057c32d17e ndis!NdisMIndicateReceiveNetBufferLists+0x104
0f fffff8057c30d6c7 ndiswan!IndicateRecvPacket+0x596
10 fffff8057c32d56b ndiswan!ApplyQoSAndIndicateRecvPacket+0x20B
11 fffff8057c32d823 ndiswan!ProcessPPPFrame+0x16F
12 fffff8057c308e62 ndiswan!ReceivePPP+0xB3
13 fffff80546c5c006 ndiswan!ProtoCoReceiveNetBufferListChain+0x442
14 fffff80546c5c2d1 ndis!ndisMCoIndicateReceiveNetBufferListsToNetBufferLists+0xF6
15 fffff8057c2b0064 ndis!NdisMCoIndicateReceiveNetBufferLists+0x11
16 fffff8057c2b06a9 raspptp!CallIndicateReceived+0x210
17 fffff80546bd9dc2 raspptp!CallProcessRxNBLs+0x199
18 fffff80541899645 ndis!ndisDispatchIoWorkItem+0x12
19 fffff80541852b65 nt!IopProcessWorkItem+0x135
1a fffff80541871d25 nt!ExpWorkerThread+0x105
1b fffff80541a00778 nt!PspSystemThreadStartup+0x55
1c ---------------- nt!KiStartSystemThread+0x28
--------------------------------------------------
CPU#2
--------------------------------------------------
Current DPC: No Active DPC
Pending DPCs:
----------------------------------------
CPU Type KDPC Function
2: Normal : 0xffffbd01467f0868 0xfffff80541953200 nt!KiEntropyDpcRoutine
DPC Watchdog Captures Analysis for CPU #2.
DPC Watchdog capture size: 641 stacks.
Number of unique stacks: 1.
No common functions detected!
The captured stacks seem to indicate that only a single DPC or generic function is the culprit.
Try to analyse what other processors were doing at the time of the following reference capture:
CPU #2 DPC Watchdog Reference Stack (#0 of 641) - Time: 16 Min 17 Sec 984.38 mSec
# RetAddr Call Site
00 fffff805418d245a nt!KeClockInterruptNotify+0x453
01 fffff80541808a45 nt!HalpTimerClockIpiRoutine+0x1A
02 fffff805419fab9a nt!KiCallInterruptServiceRoutine+0xA5
03 fffff805419fb107 nt!KiInterruptSubDispatchNoLockNoEtw+0xFA
04 fffff805418a9a9c nt!KiInterruptDispatchNoLockNoEtw+0x37
05 fffff805418a9a68 nt!KxWaitForSpinLockAndAcquire+0x2C
06 fffff8054fa611cb nt!KeAcquireSpinLockRaiseToDpc+0x88
07 fffff80546ba1eb1 wanarp!WanNdisReceivePackets+0x1BB
08 fffff80546be0b84 ndis!ndisMIndicateNetBufferListsToOpen+0x141
09 fffff80546ba7ef1 ndis!ndisMTopReceiveNetBufferLists+0x3F0E4
0a fffff80546bddfef ndis!ndisCallReceiveHandler+0x61
0b fffff80546be3a81 ndis!ndisInvokeNextReceiveHandler+0x1DF
0c fffff80546ba804e ndis!ndisFilterIndicateReceiveNetBufferLists+0x3C611
0d fffff8054e384d77 ndis!NdisFIndicateReceiveNetBufferLists+0x6E
0e fffff8054e3811a9 ourdriver+0x4D70
0f fffff80546ba7d40 ourdriver+0x11A0
10 fffff8054182a6b5 ndis!ndisDummyIrpHandler+0x100
11 fffff80541c164c8 nt!IofCallDriver+0x55
12 fffff80541c162c7 nt!IopSynchronousServiceTail+0x1A8
13 fffff80541c15646 nt!IopXxxControlFile+0xC67
14 fffff80541a0aab5 nt!NtDeviceIoControlFile+0x56
15 ---------------- nt!KiSystemServiceCopyEnd+0x25
--------------------------------------------------
CPU#3
--------------------------------------------------
Current DPC: No Active DPC
Pending DPCs:
----------------------------------------
CPU Type KDPC Function
dpcs: no pending DPCs found
Target machine version: Windows 10 Kernel Version 19041 MP (4 procs)
Also note that we also pass the NDIS_RECEIVE_FLAGS_DISPATCH_LEVEL flag to the NdisFIndicateReceiveNetBufferLists, if the current IRQL is dispatch.
Edit1:
This is also the output of !locks and !qlocks and !ready, And the contention count on one of the resources is 49135, is this normal or too high? Could this be related to our issue? The threads that are waiting on it or own it are for normal processes such as chrome, csrss, etc.
3: kd> !kdexts.locks
**** DUMP OF ALL RESOURCE OBJECTS ****
KD: Scanning for held locks.
Resource # nt!ExpTimeRefreshLock (0xfffff80542219440) Exclusively owned
Contention Count = 17
Threads: ffffcf8ce9dee640-01<*>
KD: Scanning for held locks.....
Resource # 0xffffcf8cde7f59f8 Shared 1 owning threads
Contention Count = 62
Threads: ffffcf8ce84ec080-01<*>
KD: Scanning for held locks...............................................................................................
Resource # 0xffffcf8ce08d0890 Exclusively owned
Contention Count = 49135
NumberOfSharedWaiters = 1
NumberOfExclusiveWaiters = 6
Threads: ffffcf8cf18e3080-01<*> ffffcf8ce3faf080-01
Threads Waiting On Exclusive Access:
ffffcf8ceb6ce080 ffffcf8ce1d20080 ffffcf8ce77f1080 ffffcf8ce92f4080
ffffcf8ce1d1f0c0 ffffcf8ced7c6080
KD: Scanning for held locks.
Resource # 0xffffcf8ce08d0990 Shared 1 owning threads
Threads: ffffcf8cf18e3080-01<*>
KD: Scanning for held locks.........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
Resource # 0xffffcf8ceff46350 Shared 1 owning threads
Threads: ffffcf8ce6de8080-01<*>
KD: Scanning for held locks......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
Resource # 0xffffcf8cf0cade50 Exclusively owned
Contention Count = 3
Threads: ffffcf8ce84ec080-01<*>
KD: Scanning for held locks.........................
Resource # 0xffffcf8cf0f76180 Shared 1 owning threads
Threads: ffffcf8ce83dc080-02<*>
KD: Scanning for held locks.......................................................................................................................................................................................................................................................
Resource # 0xffffcf8cf1875cb0 Shared 1 owning threads
Contention Count = 3
Threads: ffffcf8ce89db040-02<*>
KD: Scanning for held locks.
Resource # 0xffffcf8cf18742d0 Shared 1 owning threads
Threads: ffffcf8cee5e1080-02<*>
KD: Scanning for held locks....................................................................................
Resource # 0xffffcf8cdceeece0 Shared 2 owning threads
Contention Count = 4
Threads: ffffcf8ce3a1c080-01<*> ffffcf8ce5625040-01<*>
Resource # 0xffffcf8cdceeed48 Shared 1 owning threads
Threads: ffffcf8ce5625043-02<*> *** Actual Thread ffffcf8ce5625040
KD: Scanning for held locks...
Resource # 0xffffcf8cf1d377d0 Exclusively owned
Threads: ffffcf8cf0ff3080-02<*>
KD: Scanning for held locks....
Resource # 0xffffcf8cf1807050 Exclusively owned
Threads: ffffcf8ce84ec080-01<*>
KD: Scanning for held locks......
245594 total locks, 13 locks currently held
3: kd> !qlocks
Key: O = Owner, 1-n = Wait order, blank = not owned/waiting, C = Corrupt
Processor Number
Lock Name 0 1 2 3
KE - Unused Spare
MM - Unused Spare
MM - Unused Spare
MM - Unused Spare
CC - Vacb
CC - Master
EX - NonPagedPool
IO - Cancel
CC - Unused Spare
IO - Vpb
IO - Database
IO - Completion
NTFS - Struct
AFD - WorkQueue
CC - Bcb
MM - NonPagedPool
3: kd> !ready
KSHARED_READY_QUEUE fffff8053f1ada00: (00) ****------------------------------------------------------------
SharedReadyQueue fffff8053f1ada00: No threads in READY state
Processor 0: No threads in READY state
Processor 1: Ready Threads at priority 15
THREAD ffffcf8ce9dee640 Cid 2054.2100 Teb: 000000fab7bca000 Win32Thread: 0000000000000000 READY on processor 1
Processor 2: No threads in READY state
Processor 3: No threads in READY state
3: kd> dt nt!_ERESOURCE 0xffffcf8ce08d0890
+0x000 SystemResourcesList : _LIST_ENTRY [ 0xffffcf8c`e08d0610 - 0xffffcf8c`e08cf710 ]
+0x010 OwnerTable : 0xffffcf8c`ee6e8210 _OWNER_ENTRY
+0x018 ActiveCount : 0n1
+0x01a Flag : 0xf86
+0x01a ReservedLowFlags : 0x86 ''
+0x01b WaiterPriority : 0xf ''
+0x020 SharedWaiters : 0xffffae09`adcae8e0 Void
+0x028 ExclusiveWaiters : 0xffffae09`a9aabea0 Void
+0x030 OwnerEntry : _OWNER_ENTRY
+0x040 ActiveEntries : 1
+0x044 ContentionCount : 0xbfef
+0x048 NumberOfSharedWaiters : 1
+0x04c NumberOfExclusiveWaiters : 6
+0x050 Reserved2 : (null)
+0x058 Address : (null)
+0x058 CreatorBackTraceIndex : 0
+0x060 SpinLock : 0
3: kd> dx -id 0,0,ffffcf8cdcc92040 -r1 (*((ntkrnlmp!_OWNER_ENTRY *)0xffffcf8ce08d08c0))
(*((ntkrnlmp!_OWNER_ENTRY *)0xffffcf8ce08d08c0)) [Type: _OWNER_ENTRY]
[+0x000] OwnerThread : 0xffffcf8cf18e3080 [Type: unsigned __int64]
[+0x008 ( 0: 0)] IoPriorityBoosted : 0x0 [Type: unsigned long]
[+0x008 ( 1: 1)] OwnerReferenced : 0x0 [Type: unsigned long]
[+0x008 ( 2: 2)] IoQoSPriorityBoosted : 0x1 [Type: unsigned long]
[+0x008 (31: 3)] OwnerCount : 0x1 [Type: unsigned long]
[+0x008] TableSize : 0xc [Type: unsigned long]
3: kd> dx -id 0,0,ffffcf8cdcc92040 -r1 ((ntkrnlmp!_OWNER_ENTRY *)0xffffcf8cee6e8210)
((ntkrnlmp!_OWNER_ENTRY *)0xffffcf8cee6e8210) : 0xffffcf8cee6e8210 [Type: _OWNER_ENTRY *]
[+0x000] OwnerThread : 0x0 [Type: unsigned __int64]
[+0x008 ( 0: 0)] IoPriorityBoosted : 0x1 [Type: unsigned long]
[+0x008 ( 1: 1)] OwnerReferenced : 0x1 [Type: unsigned long]
[+0x008 ( 2: 2)] IoQoSPriorityBoosted : 0x1 [Type: unsigned long]
[+0x008 (31: 3)] OwnerCount : 0x0 [Type: unsigned long]
[+0x008] TableSize : 0x7 [Type: unsigned long]
Thanks for reporting this. I've tracked this down to an OS bug: there's a deadlock in wanarp. This issue appears to affect every version of the OS going back to Windows Vista.
I've filed internal issue task.ms/42393356 to track this: if you have a Microsoft support contract, your rep can get you status updates on that issue.
Meanwhile, you can partially work around this issue by either:
Indicating 1 packet at a time (NumberOfNetBufferLists==1); or
Indicating on a single CPU at a time
The bug in wanarp is exposed when 2 or more CPUs collectively process 3 or more NBLs at the same time. So either workaround would avoid the trigger conditions.
Depending on how much bandwidth you're pushing through this network interface, those options could be rather bad for CPU/battery/throughput. So please try to avoid pessimizing batching unless it's really necessary. (For example, you could make this an option that's off-by-default, unless the customer specifically uses wanarp.)
Note that you cannot fully prevent the issue yourself. Other drivers in the stack, including NDIS itself, have the right to group packets together, which would have the side effect re-batching the packets that you carefully un-batched. However, I believe that you can make a statistically significant dent in the crashes if you just indicate 1 NBL at a time, or indicate multiple NBLs on 1 CPU at a time.
Sorry this is happening to you again! wanarp is... a very old codebase.
I want to see what function in win32k.sys driver handles specific syscall number.
I attach windbg to GUI process since win32k.sys is season space driver.
Then I shift first DWORD value right by 4 bits add base address of W32pServiceTable and use u command to show function in WinDbg but address isn't valid. I checked KiSystemCall64 and it seems to be doing the same thing.
!process 0 0 winlogon.exe
.process /p (PROCESS addr)
.reload
Answer: DWORD value from table is loaded with this instruction
movsxd r11,dword ptr [r10+rax*4]
W32pServiceTable DWORD values has bit at 31 position set to 1 so movsxd sets upper 32 bits of r11 register to 1 then adding r11 and table base address leads to correct function.
These values are negative so you need to preserve that when you shift off the bits. For example:
0: kd> dd win32k!W32pServiceTable L1
fffff88b`d1568000 ff8c8340
0: kd> u win32k!W32pServiceTable + ffffffff`fff8c834 L1
win32k!NtUserGetThreadState:
fffff88b`d14f4834 4883ec28 sub rsp,28h
Also, WinDbg is very picky/weird/broken/unpredictable when it comes to sign extension so you need to be careful about how you do this. For example, this doesn't work:
0: kd> u win32k!W32pServiceTable + fff8c834 L1
fffff88c`d14f4834 ?? ???
Due to WinDbg zero extending the value. But this does:
0: kd> u win32k!W32pServiceTable + (fff8c834) L1
win32k!NtUserGetThreadState:
fffff88b`d14f4834 4883ec28 sub rsp,28h
Because the () causes WinDbg to sign extend instead of zero extend.
Lastly, this happens even on the normal service table, it's not just a Win32k thing.
Here is addiu instruction opcode (16-bit instructions, GCC option -mmicromips):
full instruction: addiu sp,sp,-280
opcode, hexa: 4F75
opcode, binary: 1001(instruction) 11101(sp is $29) 110101
My purpose is to detect all instruction of this kind (addiu sp,sp,)
and then to decode the immediate, in the above case (-280) (to follow the sp).
What I don't understand is the encoding of (-280).
Linked to: How to get a call stack backtrace?(GCC,MIPS,no frame pointer)
microMips has a specialized ADDIUSP instruction which the assembler chose to use. The first 6 bits are the opcode 010011, the next 9 bits are the encoded immediate 110111010 = 0x1BA and the LSB is reserved at 1.
The encoding for the immediate uses scaling by 4 and sign extension. Given that 0x1BA = -70 (using 9 bits) the value is -70 * 4 = -280.
I want to calculate the end offset of a parent locator in a VHD. Here is a part of the VHD header:
Cookie: cxsparse
Data offset: 0xffffffffffffffff
Table offset: 0x2000
Header version: 0x00010000
Max table entries: 10240
Block size: 0x200000
Checksum: 4294956454
Parent Unique Id: 0x9678bf077e719640b55e40826ce5d178
Parent time stamp: 525527478
Reserved: 0
Parent Unicode name:
Parent locator 1:
- platform code: 0x57326b75
- platform_data_space: 4096
- platform_data_length: 86
- reserved: 0
- platform_data_offset: 0x1000
Parent locator 2:
- platform code: 0x57327275
- platform_data_space: 65536
- platform_data_length: 34
- reserved: 0
- platform_data_offset: 0xc000
Some definitions from the Virtual Hard Disk Image Format Specification:
"Table Offset: This field stores the absolute byte offset of the Block Allocation Table (BAT) in the file.
Platform Data Space: This field stores the number of 512-byte sectors needed to store the parent hard disk locator.
Platform Data Offset: This field stores the absolute file offset in bytes where the platform specific file locator data is stored.
Platform Data Length. This field stores the actual length of the parent hard disk locator in bytes."
Based on this the end offset of the two parent locators should be:
data offset + 512 * data space:
0x1000 + 512 * 4096 = 0x201000
0xc000 + 512 * 65536 = 0x200c000
But if one uses only data offset + data space:
0x1000 + 4096 = 0x2000 //end of parent locator 1, begin of BAT
0xc000 + 65536 = 0x1c000
This latter calculation makes much more sense: the end of the first parent locator is the beginning of the BAT (see header data above); and since the first BAT entry is 0xe7 (sector offset), this corresponds to file offset 0x1ce00 (sector offset * 512), which is OK, if the second parent locator ends at 0x1c000.
But if one uses the formula data offset + 512 * data space, he ends up having other data written in the parent locator. (But, in this example there would be no data corruption, since Platform Data Length is very small)
So is this a mistake in the specification, and the sentence
"Platform Data Space: This field stores the number of 512-byte sectors needed to store the parent hard disk locator."
should be
"Platform Data Space: This field stores the number of bytes needed to store the parent hard disk locator."?
Apparently Microsoft does not care about correcting their mistake, this being already discovered by Virtualbox developers. VHD.cpp contains the following comment:
/*
* The VHD spec states that the DataSpace field holds the number of sectors
* required to store the parent locator path.
* As it turned out VPC and Hyper-V store the amount of bytes reserved for the
* path and not the number of sectors.
*/
So, I'm trying to run some simple code, jdk-8, output via jol
System.out.println(VMSupport.vmDetails());
Integer i = new Integer(23);
System.out.println(ClassLayout.parseInstance(i)
.toPrintable());
The first attempt is to run it with compressed oops disabled and compressed klass also on 64-bit JVM.
-XX:-UseCompressedOops -XX:-UseCompressedClassPointers
The output, pretty much expected is :
Running 64-bit HotSpot VM.
Objects are 8 bytes aligned.
java.lang.Integer object internals:
OFFSET SIZE TYPE DESCRIPTION VALUE
0 4 (object header) 01 00 00 00 (00000001 00000000 00000000 00000000) (1)
4 4 (object header) 00 00 00 00 (00000000 00000000 00000000 00000000) (0)
8 4 (object header) 48 33 36 97 (01001000 00110011 00110110 10010111) (-1758055608)
12 4 (object header) 01 00 00 00 (00000001 00000000 00000000 00000000) (1)
16 4 int Integer.value 23
20 4 (loss due to the next object alignment)
Instance size: 24 bytes (reported by Instrumentation API)
Space losses: 0 bytes internal + 4 bytes external = 4 bytes total
That makes sense : 8 bytes klass word + 8 bytes mark word + 4 bytes for the actual value and 4 for padding (to align on 8 bytes) = 24 bytes.
The second attempt it to run it with compressed oops enabled compressed klass also on 64-bit JVM.
Again, the output is pretty much understandable :
Running 64-bit HotSpot VM.
Using compressed oop with 3-bit shift.
Using compressed klass with 3-bit shift.
Objects are 8 bytes aligned.
OFFSET SIZE TYPE DESCRIPTION VALUE
0 4 (object header) 01 00 00 00 (00000001 00000000 00000000 00000000) (1)
4 4 (object header) 00 00 00 00 (00000000 00000000 00000000 00000000) (0)
8 4 (object header) f9 33 01 f8 (11111001 00110011 00000001 11111000) (-134138887)
12 4 int Dummy.i 42
Instance size: 16 bytes (reported by Instrumentation API).
4 bytes compressed oop (klass word) + 8 bytes mark word + 4 bytes for the value + no space loss = 16 bytes.
The thing that does NOT make sense to me is this use-case:
-XX:+UseCompressedOops -XX:+UseCompressedClassPointers -XX:ObjectAlignmentInBytes=16
The output is this:
Running 64-bit HotSpot VM.
Using compressed oop with 4-bit shift.
Using compressed klass with 0x0000001000000000 base address and 0-bit shift.
I was really expecting to both be "4-bit shift". Why they are not?
EDIT
The second example is run with :
XX:+UseCompressedOops -XX:+UseCompressedClassPointers
And the third one with :
-XX:+UseCompressedOops -XX:+UseCompressedClassPointers -XX:ObjectAlignmentInBytes=16
Answers to these questions are mostly easy to figure out when looking into OpenJDK code.
For example, grep for "UseCompressedClassPointers", this will get you to arguments.cpp:
// Check the CompressedClassSpaceSize to make sure we use compressed klass ptrs.
if (UseCompressedClassPointers) {
if (CompressedClassSpaceSize > KlassEncodingMetaspaceMax) {
warning("CompressedClassSpaceSize is too large for UseCompressedClassPointers");
FLAG_SET_DEFAULT(UseCompressedClassPointers, false);
}
}
Okay, interesting, there is "CompressedClassSpaceSize"? Grep for its definition, it's in globals.hpp:
product(size_t, CompressedClassSpaceSize, 1*G, \
"Maximum size of class area in Metaspace when compressed " \
"class pointers are used") \
range(1*M, 3*G) \
Aha, so the class area is in Metaspace, and it takes somewhere between 1 Mb and 3 Gb of space. Let's grep for "CompressedClassSpaceSize" usages, because that will take us to actual code that handles it, say in metaspace.cpp:
// For UseCompressedClassPointers the class space is reserved above
// the top of the Java heap. The argument passed in is at the base of
// the compressed space.
void Metaspace::initialize_class_space(ReservedSpace rs) {
So, compressed classes are allocated in a smaller class space outside the Java heap, which does not require shifting -- even 3 gigabytes is small enough to use only the lowest 32 bits.
I will try to extend a little bit on the answer provided by Alexey as some things might not be obvious.
Following Alexey suggestion, if we search the source code of OpenJDK for where compressed klass bit shift value is assigned, we will find the following code in metaspace.cpp:
void Metaspace::set_narrow_klass_base_and_shift(address metaspace_base, address cds_base) {
// some code removed
if ((uint64_t)(higher_address - lower_base) <= UnscaledClassSpaceMax) {
Universe::set_narrow_klass_shift(0);
} else {
assert(!UseSharedSpaces, "Cannot shift with UseSharedSpaces");
Universe::set_narrow_klass_shift(LogKlassAlignmentInBytes);
}
As we can see, the class shift can either be 0(or basically no shifting) or 3 bits, because LogKlassAlignmentInBytes is a constant defined in globalDefinitions.hpp:
const int LogKlassAlignmentInBytes = 3;
So, the answer to your quetion:
I was really expecting to both be "4-bit shift". Why they are not?
is that ObjectAlignmentInBytes does not have any effect on compressed class pointers alignment in the metaspace which is always 8bytes.
Of course this conclusion does not answer the question:
"Why when using -XX:ObjectAlignmentInBytes=16 with -XX:+UseCompressedClassPointers the narrow klass shift becomes zero? Also, without shifting how can the JVM reference the class space with 32-bit references, if the heap is 4GBytes or more?"
We already know that the class space is allocated on top of the java heap and can be up to 3G in size. With that in mind let's make a few tests. -XX:+UseCompressedOops -XX:+UseCompressedClassPointers are enabled by default, so we can eliminate these for conciseness.
Test 1: Defaults - 8 Bytes aligned
$ java -XX:ObjectAlignmentInBytes=8 -XX:+UnlockDiagnosticVMOptions -XX:+PrintCompressedOopsMode -version
heap address: 0x00000006c0000000, size: 4096 MB, zero based Compressed Oops
Narrow klass base: 0x0000000000000000, Narrow klass shift: 3
Compressed class space size: 1073741824 Address: 0x00000007c0000000 Req Addr: 0x00000007c0000000
Notice that the heap starts at address 0x00000006c0000000 in the virtual space and has a size of 4GBytes. Let's jump by 4Gbytes from where the heap starts and we land just where class space begins.
0x00000006c0000000 + 0x0000000100000000 = 0x00000007c0000000
The class space size is 1Gbyte, so let's jump by another 1Gbyte:
0x00000007c0000000 + 0x0000000040000000 = 0x0000000800000000
and we land just below 32Gbytes. With a 3 bits class space shifting the JVM is able to reference the entire class space, although it's at the limit (intentionally).
Test 2: 16 bytes aligned
java -XX:ObjectAlignmentInBytes=16 -XX:+UnlockDiagnosticVMOptions -XX:+PrintCompressedOopsMode -version
heap address: 0x0000000f00000000, size: 4096 MB, zero based Compressed Oops
Narrow klass base: 0x0000001000000000, Narrow klass shift: 0
Compressed class space size: 1073741824 Address: 0x0000001000000000 Req Addr: 0x0000001000000000
This time we can observe that the heap address is different, but let's try the same steps:
0x0000000f00000000 + 0x0000000100000000 = 0x0000001000000000
This time around heap space ends just below 64GBytes virtual space boundary and the class space is allocated above 64Gbyte boundary. Since class space can use only 3 bits shifting, how can the JVM reference the class space located above 64Gbyte? The key is:
Narrow klass base: 0x0000001000000000
The JVM still uses 32 bit compressed pointers for the class space, but when encoding and decoding these, it will always add 0x0000001000000000 base to the compressed reference instead of using shifting. Note, that this approach works as long as the referenced chunk of memory is lower than 4Gbytes (the limit for 32 bits references). Considering that the class space can have a maximum of 3Gbytes we are comfortably within the limits.
3: 16 bytes aligned, pin heap base at 8g
$ java -XX:ObjectAlignmentInBytes=16 -XX:HeapBaseMinAddress=8g -XX:+UnlockDiagnosticVMOptions -XX:+PrintCompressedOopsMode -version
heap address: 0x0000000200000000, size: 4096 MB, zero based Compressed Oops
Narrow klass base: 0x0000000000000000, Narrow klass shift: 3
Compressed class space size: 1073741824 Address: 0x0000000300000000 Req Addr: 0x0000000300000000
In this test we are still keeping the -XX:ObjectAlignmentInBytes=16, but also asking the JVM to allocate the heap at the 8th GByte in the virtual address space using -XX:HeapBaseMinAddress=8g JVM argument. The class space will begin at 12th GByte in the virtual address space and 3 bits shifting is more than enough to reference it.
Hopefully, these tests and their results answer the question:
"Why when using -XX:ObjectAlignmentInBytes=16 with -XX:+UseCompressedClassPointers the narrow klass shift becomes zero? Also, without shifting how can the JVM reference the class space with 32-bit references, if the heap is 4GBytes or more?"