ARM linux kernel head-common.S

ARM linux kernel head-common.S - linux-kernel

I was looking head-common.S
at the __mmap_switched:
.long init_thread_union + THREAD_START_SP # sp //for stack pointer
THREAD_START_SP is defined THREAD_SIZE(8192) - 8 in "thread+info.h"
set stack size 8KB(8129) and minus 8byte.
why minus 8byte?
i suspect, i think DA(decrement after) right?

The 8 bytes aligned is the requirement in APCS.
In APCS, the chapter 5.2.1 The Stack,
The stack must also conform to the following constraint at a public interface:
SP mod 8 = 0. The stack must be double-word aligned.
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.subset.swdev.abi/index.html

Related

Dumping W32pServiceTable

I want to see what function in win32k.sys driver handles specific syscall number.
I attach windbg to GUI process since win32k.sys is season space driver.
Then I shift first DWORD value right by 4 bits add base address of W32pServiceTable and use u command to show function in WinDbg but address isn't valid. I checked KiSystemCall64 and it seems to be doing the same thing.
!process 0 0 winlogon.exe
.process /p (PROCESS addr)
.reload
Answer: DWORD value from table is loaded with this instruction
movsxd r11,dword ptr [r10+rax*4]
W32pServiceTable DWORD values has bit at 31 position set to 1 so movsxd sets upper 32 bits of r11 register to 1 then adding r11 and table base address leads to correct function.

These values are negative so you need to preserve that when you shift off the bits. For example:
0: kd> dd win32k!W32pServiceTable L1
fffff88b`d1568000 ff8c8340
0: kd> u win32k!W32pServiceTable + ffffffff`fff8c834 L1
win32k!NtUserGetThreadState:
fffff88b`d14f4834 4883ec28 sub rsp,28h
Also, WinDbg is very picky/weird/broken/unpredictable when it comes to sign extension so you need to be careful about how you do this. For example, this doesn't work:
0: kd> u win32k!W32pServiceTable + fff8c834 L1
fffff88c`d14f4834 ?? ???
Due to WinDbg zero extending the value. But this does:
0: kd> u win32k!W32pServiceTable + (fff8c834) L1
win32k!NtUserGetThreadState:
fffff88b`d14f4834 4883ec28 sub rsp,28h
Because the () causes WinDbg to sign extend instead of zero extend.
Lastly, this happens even on the normal service table, it's not just a Win32k thing.

MPI_ALLreduce with Fortran and 2 bytes integer

I'm trying to do an MPI sum of 2 bytes integer:
INTEGER, PARAMETER :: SIK2 = SELECTED_INT_KIND(2)
INTEGER(SIK2) :: s_save(dim)
Indeed its an array which takes integer values from 1 to 48 max, so 2 bytes is enough for memory reasons.
Therefore I tried the following:
CALL MPI_TYPE_CREATE_F90_INTEGER(SIK2, int2type, ierr)
CALL MPI_ALLreduce(MPI_IN_PLACE, s_save, nkpt_in, int2type, MPI_SUM, world_comm, ierr)
This works well for Gfortran + openmpi.
However in the case of intel I get a crash:
MPI_Allreduce(1000)......: MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x55d2160, count=987, dtype=USER<f90_integer>, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_SUM_check_dtype(106): MPI_Op MPI_SUM operation not defined for this datatype
Is there a proper (or recommended) way to do this so that it works for most compilers?

Windows 10 x64: Unable to get PXE on Windbg

Can't understand how Windows Memory Manager works.
I look at the attached user process (dbgview.exe).
It is WOW64-process. At the specified address (0x76560000) there is .text section of the kernel32.dll module (also WOW64).
Why there is no PTE and other tables in the process page table pointing to those virtual address?
kd> db 76560000
00000000`76560000 8b ff 55 8b ec 51 56 57-33 f6 89 55 fc 56 68 80 ..U..QVW3..U.Vh.
<...>
kd> !pte 76560000
VA 0000000076560000
PXE at FFFFF6FB7DBED000 PPE at FFFFF6FB7DA00008 PDE at FFFFF6FB40001D90 PTE at FFFFF680003B2B00
Unable to get PXE FFFFF6FB7DBED000
kd> db FFFFF680003B2B00
fffff680`003b2b00 ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ?? ???????????????
<...>
I know that pages will be allocated after first access (with page fault) have occured, but why there is no protype PTE too?

Firstly, translate an arbitrary virtual address to physical using !vtop to see the dirbase of the process in the process of translation, or use !process to find the dirbase of the process:
lkd> .process /p fffffa8046a2e5f0
Implicit process is now fffffa80`46a2e5f0
lkd> .context 77fa90000
lkd> !vtop 0 13fe60000
Amd64VtoP: Virt 00000001`3fe60000, pagedir 7`7fa90000
Amd64VtoP: PML4E 7`7fa90000
Amd64VtoP: PDPE 1`c2e83020
Amd64VtoP: PDE 7`84e04ff8
Amd64VtoP: PTE 4`be585300
Amd64VtoP: Mapped phys 6`3efae000
Virtual address 13fe60000 translates to physical address 63efae000.
Then find that physical frame in the PFN database (in this case the physical page for PML4 (cr3 page aka. dirbase) is 77fa90 with full physical address 77fa90000:
lkd> !pfn 77fa90
PFN 0077FA90 at address FFFFFA80167EFB00
flink FFFFFA8046A2E5F0 blink / share count 00000005 pteaddress FFFFF6FB7DBEDF68
reference count 0001 used entry count 0000 Cached color 0 Priority 0
restore pte 00000080 containing page 77FA90 Active M
Modified
The address FFFFF6FB7DBED000 is therefore the virtual address of the PML4 page and FFFFF6FB7DBEDF68 is the virtual address of the PML4E self reference entry (1ed*8 = f68).
FFFFF6FB7DBED000 = 1111111111111111111101101111101101111101101111101101000000000000
1111111111111111 111101101 111101101 111101101 111101101 000000000000
The PML4 can only be at a virtual address where the PML4E, PDTPE, PDE and PTE index are the same, so there are actually 2^9 different combinations of that and windows 7 always selects 0x1ed i.e. 111101101. The reason for this is because the PML4 contains a PML4 that points to itself i.e. the physical frame of the PML4, so it will need to keep indexing to that same location at every level of the hierarchy.
The PML4, being a page table page, must reside in the kernel, and kernel addresses are high-canonical, i.e. prefixed with 1111111111111111, and kernel addresses begin with 00001 through 11111 i.e. from 08 to ff
The range of possible addresses that a 64 bit OS that uses 8TiB for user address space can place it at is therefore 31*(2^4) = 496 different possible locations and not actually 2^9:
1111111111111111 000010000 000010000 000010000 000010000 000000000000
1111111111111111 111111111 111111111 111111111 111111111 000000000000
I.e. the first is FFFF080402010000, the second is FFFF088442211000, the last is FFFFFFFFFFFFF000.
Note:
Up until Windows 10 TH2, the magic index for the Self-Reference PML4 entry was 0x1ed as mentioned above. But what about Windows 10 from 1607? Well Microsoft uped their game, as a constant battle for improving Windows security the index is randomized at boot-time, so 0x1ed is now one of the 512 [sic. (496)] possible values (i.e. 9-bit index) that the Self-Reference entry index can have. And side effect, it also broke some of their own tools, like the !pte2va WinDbg command.
0xFFFFF68000000000 is the address of the first PTE in the first page table page, so basically MmPteBase, except because on Windows 10 1607 the PML4E can be an other than 0x1ed, the base is not always 0xFFFFF68000000000 as a result, and it uses a variable nt!MmPteBase to know instantly where the base of the page table page allocations begins. Previously, this symbol does not exist in ntoskrnl.exe, because it has a hardcoded base 0xFFFFF68000000000. The address of the first and last page table page is going to be:
first last
* pml4e_offset : 0x1ed 0x1ed
* pdpe_offset : 0x000 0x1ff
* pde_offset : 0x000 0x1ff
* pte_offset : 0x000 0x1ff
* offset : 0x000 0x000
This gives 0xFFFFF68000000000 for the first and 0xFFFFF6FFFFFFF000 for the last page table page when the PML4E index is 0x1ed. PDEs + PDPTEs + PML4Es + PTEs are assigned in this range.
Therefore, to be able to translate a virtual address to its PTE virtual address (and !pte2va is the reverse of this), you affix 111101101 to the start of the virtual address and then you truncate the last 12 bits (the page offset, which is no longer useful) and then you times it by 8 bytes (the PTE size) (i.e. add 3 zeroes to the end, which creates a new page offset from the last level index into the page that contains the PTEs times the size of a PTE structure). Concatenating the PML4E index to the start simply causes it to loop back one time such that you actually get the PTE rather than what the PTE points to. Concatenating it to the start is the same thing as adding it to MmPteBase.
Here is simple C++ code to do it:
// pte.cpp
#include<iostream>
#include<string>
int main(int argc, char *argv[]) {
unsigned long long int input = std::stoull(argv[1], nullptr, 16);
long long int ptebase = 0xFFFFF68000000000;
long long int pteaddress = ptebase + ((input >> 12) << 3);
std::cout << "0x" << std::hex << pteaddress;
}
C:\> pte 13fe60000
0xfffff680009ff300
To get the PDE virtual address you have to affix it twice and then truncate the last 21 bits and then times by 8. This is how !pte is supposed to work, and is the opposite of !pte2va.
Similarly, PDEs + PDPTEs + PML4Es are assigned in the range:
first last
* pml4e_offset : 0x1ed 0x1ed
* pdpe_offset : 0x1ed 0x1ed
* pde_offset : 0x000 0x1ff
* pte_offset : 0x000 0x1ff
* offset : 0x000 0x000
Because when you get to 0x1ed for the pdpte offset within the page table page range, all of a sudden, you are looping back in the PML4 once again, so you get the PDE.
If it says there is no PTE for an address within a virtual page for which the corresponding physical frame is shown to be part of the working set by VMMap, then you might be experiencing my issue, where you need to use .process /P if you're doing live kernel debugging (local or remote) to explicitly tell the debugger that you want to translate user and kernel addresses in the context of the process and not the debugger.

I have found that since Windows 10 Anniversary Update (1607, 10.0.14393) PML4 table had been randomized to mitigate kernel heap spraying.
It means that probably Page Table is not placed at 0xFFFFF6800000.

addiu instruction encoding (MIPS,GCC)

Here is addiu instruction opcode (16-bit instructions, GCC option -mmicromips):
full instruction: addiu sp,sp,-280
opcode, hexa: 4F75
opcode, binary: 1001(instruction) 11101(sp is $29) 110101
My purpose is to detect all instruction of this kind (addiu sp,sp,)
and then to decode the immediate, in the above case (-280) (to follow the sp).
What I don't understand is the encoding of (-280).
Linked to: How to get a call stack backtrace?(GCC,MIPS,no frame pointer)

microMips has a specialized ADDIUSP instruction which the assembler chose to use. The first 6 bits are the opcode 010011, the next 9 bits are the encoded immediate 110111010 = 0x1BA and the LSB is reserved at 1.
The encoding for the immediate uses scaling by 4 and sign extension. Given that 0x1BA = -70 (using 9 bits) the value is -70 * 4 = -280.

java8 -XX:+UseCompressedOops -XX:ObjectAlignmentInBytes=16

So, I'm trying to run some simple code, jdk-8, output via jol
System.out.println(VMSupport.vmDetails());
Integer i = new Integer(23);
System.out.println(ClassLayout.parseInstance(i)
.toPrintable());
The first attempt is to run it with compressed oops disabled and compressed klass also on 64-bit JVM.
-XX:-UseCompressedOops -XX:-UseCompressedClassPointers
The output, pretty much expected is :
Running 64-bit HotSpot VM.
Objects are 8 bytes aligned.
java.lang.Integer object internals:
OFFSET SIZE TYPE DESCRIPTION VALUE
0 4 (object header) 01 00 00 00 (00000001 00000000 00000000 00000000) (1)
4 4 (object header) 00 00 00 00 (00000000 00000000 00000000 00000000) (0)
8 4 (object header) 48 33 36 97 (01001000 00110011 00110110 10010111) (-1758055608)
12 4 (object header) 01 00 00 00 (00000001 00000000 00000000 00000000) (1)
16 4 int Integer.value 23
20 4 (loss due to the next object alignment)
Instance size: 24 bytes (reported by Instrumentation API)
Space losses: 0 bytes internal + 4 bytes external = 4 bytes total
That makes sense : 8 bytes klass word + 8 bytes mark word + 4 bytes for the actual value and 4 for padding (to align on 8 bytes) = 24 bytes.
The second attempt it to run it with compressed oops enabled compressed klass also on 64-bit JVM.
Again, the output is pretty much understandable :
Running 64-bit HotSpot VM.
Using compressed oop with 3-bit shift.
Using compressed klass with 3-bit shift.
Objects are 8 bytes aligned.
OFFSET SIZE TYPE DESCRIPTION VALUE
0 4 (object header) 01 00 00 00 (00000001 00000000 00000000 00000000) (1)
4 4 (object header) 00 00 00 00 (00000000 00000000 00000000 00000000) (0)
8 4 (object header) f9 33 01 f8 (11111001 00110011 00000001 11111000) (-134138887)
12 4 int Dummy.i 42
Instance size: 16 bytes (reported by Instrumentation API).
4 bytes compressed oop (klass word) + 8 bytes mark word + 4 bytes for the value + no space loss = 16 bytes.
The thing that does NOT make sense to me is this use-case:
-XX:+UseCompressedOops -XX:+UseCompressedClassPointers -XX:ObjectAlignmentInBytes=16
The output is this:
Running 64-bit HotSpot VM.
Using compressed oop with 4-bit shift.
Using compressed klass with 0x0000001000000000 base address and 0-bit shift.
I was really expecting to both be "4-bit shift". Why they are not?
EDIT
The second example is run with :
XX:+UseCompressedOops -XX:+UseCompressedClassPointers
And the third one with :
-XX:+UseCompressedOops -XX:+UseCompressedClassPointers -XX:ObjectAlignmentInBytes=16

Answers to these questions are mostly easy to figure out when looking into OpenJDK code.
For example, grep for "UseCompressedClassPointers", this will get you to arguments.cpp:
// Check the CompressedClassSpaceSize to make sure we use compressed klass ptrs.
if (UseCompressedClassPointers) {
if (CompressedClassSpaceSize > KlassEncodingMetaspaceMax) {
warning("CompressedClassSpaceSize is too large for UseCompressedClassPointers");
FLAG_SET_DEFAULT(UseCompressedClassPointers, false);
}
}
Okay, interesting, there is "CompressedClassSpaceSize"? Grep for its definition, it's in globals.hpp:
product(size_t, CompressedClassSpaceSize, 1*G, \
"Maximum size of class area in Metaspace when compressed " \
"class pointers are used") \
range(1*M, 3*G) \
Aha, so the class area is in Metaspace, and it takes somewhere between 1 Mb and 3 Gb of space. Let's grep for "CompressedClassSpaceSize" usages, because that will take us to actual code that handles it, say in metaspace.cpp:
// For UseCompressedClassPointers the class space is reserved above
// the top of the Java heap. The argument passed in is at the base of
// the compressed space.
void Metaspace::initialize_class_space(ReservedSpace rs) {
So, compressed classes are allocated in a smaller class space outside the Java heap, which does not require shifting -- even 3 gigabytes is small enough to use only the lowest 32 bits.

I will try to extend a little bit on the answer provided by Alexey as some things might not be obvious.
Following Alexey suggestion, if we search the source code of OpenJDK for where compressed klass bit shift value is assigned, we will find the following code in metaspace.cpp:
void Metaspace::set_narrow_klass_base_and_shift(address metaspace_base, address cds_base) {
// some code removed
if ((uint64_t)(higher_address - lower_base) <= UnscaledClassSpaceMax) {
Universe::set_narrow_klass_shift(0);
} else {
assert(!UseSharedSpaces, "Cannot shift with UseSharedSpaces");
Universe::set_narrow_klass_shift(LogKlassAlignmentInBytes);
}
As we can see, the class shift can either be 0(or basically no shifting) or 3 bits, because LogKlassAlignmentInBytes is a constant defined in globalDefinitions.hpp:
const int LogKlassAlignmentInBytes = 3;
So, the answer to your quetion:
I was really expecting to both be "4-bit shift". Why they are not?
is that ObjectAlignmentInBytes does not have any effect on compressed class pointers alignment in the metaspace which is always 8bytes.
Of course this conclusion does not answer the question:
"Why when using -XX:ObjectAlignmentInBytes=16 with -XX:+UseCompressedClassPointers the narrow klass shift becomes zero? Also, without shifting how can the JVM reference the class space with 32-bit references, if the heap is 4GBytes or more?"
We already know that the class space is allocated on top of the java heap and can be up to 3G in size. With that in mind let's make a few tests. -XX:+UseCompressedOops -XX:+UseCompressedClassPointers are enabled by default, so we can eliminate these for conciseness.
Test 1: Defaults - 8 Bytes aligned
$ java -XX:ObjectAlignmentInBytes=8 -XX:+UnlockDiagnosticVMOptions -XX:+PrintCompressedOopsMode -version
heap address: 0x00000006c0000000, size: 4096 MB, zero based Compressed Oops
Narrow klass base: 0x0000000000000000, Narrow klass shift: 3
Compressed class space size: 1073741824 Address: 0x00000007c0000000 Req Addr: 0x00000007c0000000
Notice that the heap starts at address 0x00000006c0000000 in the virtual space and has a size of 4GBytes. Let's jump by 4Gbytes from where the heap starts and we land just where class space begins.
0x00000006c0000000 + 0x0000000100000000 = 0x00000007c0000000
The class space size is 1Gbyte, so let's jump by another 1Gbyte:
0x00000007c0000000 + 0x0000000040000000 = 0x0000000800000000
and we land just below 32Gbytes. With a 3 bits class space shifting the JVM is able to reference the entire class space, although it's at the limit (intentionally).
Test 2: 16 bytes aligned
java -XX:ObjectAlignmentInBytes=16 -XX:+UnlockDiagnosticVMOptions -XX:+PrintCompressedOopsMode -version
heap address: 0x0000000f00000000, size: 4096 MB, zero based Compressed Oops
Narrow klass base: 0x0000001000000000, Narrow klass shift: 0
Compressed class space size: 1073741824 Address: 0x0000001000000000 Req Addr: 0x0000001000000000
This time we can observe that the heap address is different, but let's try the same steps:
0x0000000f00000000 + 0x0000000100000000 = 0x0000001000000000
This time around heap space ends just below 64GBytes virtual space boundary and the class space is allocated above 64Gbyte boundary. Since class space can use only 3 bits shifting, how can the JVM reference the class space located above 64Gbyte? The key is:
Narrow klass base: 0x0000001000000000
The JVM still uses 32 bit compressed pointers for the class space, but when encoding and decoding these, it will always add 0x0000001000000000 base to the compressed reference instead of using shifting. Note, that this approach works as long as the referenced chunk of memory is lower than 4Gbytes (the limit for 32 bits references). Considering that the class space can have a maximum of 3Gbytes we are comfortably within the limits.
3: 16 bytes aligned, pin heap base at 8g
$ java -XX:ObjectAlignmentInBytes=16 -XX:HeapBaseMinAddress=8g -XX:+UnlockDiagnosticVMOptions -XX:+PrintCompressedOopsMode -version
heap address: 0x0000000200000000, size: 4096 MB, zero based Compressed Oops
Narrow klass base: 0x0000000000000000, Narrow klass shift: 3
Compressed class space size: 1073741824 Address: 0x0000000300000000 Req Addr: 0x0000000300000000
In this test we are still keeping the -XX:ObjectAlignmentInBytes=16, but also asking the JVM to allocate the heap at the 8th GByte in the virtual address space using -XX:HeapBaseMinAddress=8g JVM argument. The class space will begin at 12th GByte in the virtual address space and 3 bits shifting is more than enough to reference it.
Hopefully, these tests and their results answer the question:
"Why when using -XX:ObjectAlignmentInBytes=16 with -XX:+UseCompressedClassPointers the narrow klass shift becomes zero? Also, without shifting how can the JVM reference the class space with 32-bit references, if the heap is 4GBytes or more?"

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio