Dump the contents of TLB buffer of x86 CPU - memory-management

Is it possible to get list of translations (from virtual pages into physical pages) from TLB (Translation lookaside buffer, this is a special cache in the CPU). I mean modern x86 or x86_64; and I want to do it in programmatic way, not by using JTAG and shifting all TLB entries out.

The linux kernel has no such dumper, there is page from linux kernel about cache and tlb: https://www.kernel.org/doc/Documentation/cachetlb.txt "Cache and TLB Flushing Under Linux." David S. Miller
There was an such TLB dump in 80386DX (and 80486, and possibly in "Embedded Pentium" 100-166 MHz / "Embedded Pentium MMX 200-233 MHz" in 1998):
1 - Book "MICROPROCESSORS: THE 8086/8088, 80186/80286, 80386/80486 AND THE PENTIUM FAMILY", ISBN 9788120339422, 2010, page 579
This was done via Test Registers TR6 TR7:
2 - Book "Microprocessors & Microcontrollers" by Godse&Godse, 2008 ISBN 9788184312973 page SA3-PA19: "3.2.7.3 Test Registers" "only two test registers (TR6-TR7) are currently defined. ... These registers are used to check translation lookaside buffer (TLB) of the paging unit."
3 "x86-Programmierung und -Betriebsarten (Teil 5). Die Testregister TR6 und TR7", deutsche article about registers: "Zur Prüfung des Translation-Lookaside-Buffers sind die zwei Testregister TR6 und TR7 vorhanden. Sie werden als Test-Command-Register (TR6) und Testdatenregister (TR7) bezeichnet. "
4 Intel's "Embedded Pentium® Processor Family Developer’s Manual", part "26 Model Specific Registers and Functions" page 8 "26.2.1.2 TLB Test Registers"
TR6 is command register, the linear address is written to it. It can be used to write to TLB or to read line from TLB. TR7 is data to be written to TLB or read from TLB.
Wikipedia says in https://en.wikipedia.org/wiki/Test_register that reading TR6/TR7 "generate invalid opcode exception on any CPU newer than 80486."
The encoding of mov tr6/tr7 was available only to privilege level 0: http://www.fermimn.gov.it/linux/quarta/x86/movrs.htm
0F 24 /r movl tr6/tr7,r32 12 Move (test register) to (register)
movl %tr6,%ebx
movl %tr7,%ebx
0F 26 /r movl r32,tr6/tr7 12 Move (register) to (test register)
movl %ebx,%tr6
movl %ebx,%tr7

You can get the list of VA-PA translations stored in TLB but you may have to use a processor emulator like qemu. You can download and install qemu from http://wiki.qemu.org/Main_Page
You can boot a kernel which is stored in a disk image (typically in qcow2 or raw format) and run your application. You may have to tweak the code in qemu to print the contents of TLB. Look at tlb_* functions in qemu/exec.c You may want to add a tlb_dump_function to print the contents of the TLB. As far as I know, this is the closest you can get to dumping the contents of TLB.
P.S: I started answering this question and then realized it was an year old.

Related

Ask for clarification about "the segment registers continue to point to the same linear addresses as in real address mode" [duplicate]

This question already has an answer here:
How can the x86 processor fetch the instruction just after GDT is loaded by a bootloader?
(1 answer)
Closed 1 year ago.
The question is about persistent validity of code segment selector while switching from real mode to protected mode on intel i386. The switching code is as follows (excerpted from bootasm.S of xv6 x86 version):
9138 # Switch from real to protected mode. Use a bootstrap GDT that makes
9139 # virtual addresses map directly to physical addresses so that the
9140 # effective memory map doesn’t change during the transition.
9141 lgdt gdtdesc
9142 movl %cr0, %eax
9143 orl $CR0_PE, %eax
9144 movl %eax, %cr0
9150 # Complete the transition to 32−bit protected mode by using a long jmp
9151 # to reload %cs and %eip. The segment descriptors are set up with no
9152 # translation, so that the mapping is still the identity mapping.
9153 ljmp $(SEG_KCODE<<3), $start32
The GDT layout is as follows:
9182 gdt:
9183 SEG_NULLASM # null seg
9184 SEG_ASM(STA_X|STA_R, 0x0, 0xffffffff) # code seg
9185 SEG_ASM(STA_W, 0x0, 0xffffffff) # data seg
After executing line 9144, the processor switches to protected mode in which mere segment memory management is enabled (but paging has not yet been enabled). My understanding is that, since segment MM has been enabled, the fetching of the following instruction should conform to the rules of segment MM. At this point (immediately before line 9153), however, the code selector remains 0, which in my understanding means the code segment should have selected the zero-th descriptor in GDT, which is null. But my question comes out naturally, how such a null descriptor can load the supposed ljmp instruction? I tried to answer my question by googling, and a document gives some explanation as follows: http://www.logix.cz/michal/doc/i386/chp10-03.htm#10-03
The segment registers continue to point to the same linear addresses
as in real address mode
This sentence seems to answer my question: if the segment registers continue to point to the same linear addresses, the next instruction should be the same as in real mode, that is, ljmp. But I immediately have a sequence of new questions: why can the segment selector "continue to point to the same linear addresses"? Hasn't the processor been changed to protected mode? Doesn't the value of 0 in %cs point to the zero-th descriptor, instead of the 1st (set in line 9184) which is the supposed descriptor to fetch ljmp instruction? How does the x86 CPU magically know it is the ljmp that is the next instruction it should execute? Where is the description in any manual that describe this magic? I tried to persuade myself that the ljmp has been prefetched in the processor's instruction queue, but the second paragraph of the same webpage tells me that the prefetched ljmp, if any, has been invalidated so the CPU should fetch the next instruction afresh. Can you please give me some clarification of how "the segment registers continue to point to the same linear addresses as in real address mode" magically? Thank you.
PS, the CPU I am working on is intel i386 compatible.
The modern reference is the Intel Software Developer's Manual, Volume 3A, Section 9.9.1, "Switching to protected mode".
Intel isn't big on explaining how magic works internally. What it says, and all you need to know, is that if your movl %eax, %cr0 is immediately followed by a far jump or far call, then everything will work. If you put any other instruction there, then "random failures can occur" (their wording).
As it says, %cs continues to hold its previous value, and presumably that's the value that would be pushed on the stack if you did a far call as the instruction after movl %eax, %cr0. (Where the stack would be is another interesting question - I think everyone uses the jump instead so it rarely comes up.) But for this one instruction it evidently isn't used as a selector in the usual way.
One guess as to how it might work: we know that in protected mode, there are hidden registers that store the segment attributes, and are reloaded from the descriptor table when you load a segment register. So the movl %eax, %cr0 might cause the hidden register corresponding to %cs to be loaded with attributes of a segment whose base address is the linear address of the current 16-bit segment: e.g. if %cs contained 0x1234 then it could be a segment with base address 0x12340. But the %cs register itself could be left alone, temporarily not matching its hidden counterpart. Then if the high bits of %eip are zeroed, the next instruction would be fetched from the right place. That instruction is required to be the long jump which will reload %cs as well as the hidden segment attribute register.
It's also possible that it just sets some internal flag that says "even though in protected mode, fetch the next instruction according to real-mode address translation". Then this flag gets cleared when a far jump occurs, or after one instruction has been fetched, or something like that.

What does write_cr0(read_cr0() | 0x10000) do?

I searched the web a lot but didn't find a short explanation about what write_cr0(read_cr0() | 0x10000) really do. It is related to the Linux kernel and I curios about developing LKM's. I want to know what this really do and what are the security issues with this.
It used to remove the write protection on the syscall table.
But how it is really works? and what does each thing in this line?
CR0 is one of the control registers available on x86 CPUs, which contains flags controlling CPU features related to memory protection, multitasking, paging, etc. You can find a full description in Volume 3, Section 2.5 of Intel's Software Developer's Manual.
These registers are accessed by special instructions that the compiler doesn't normally generate, so read_cr0() is a function which executes the instruction to read this register (via inline assembly) and returns the result in a general-purpose register. Likewise, write_cr0() writes to this register.
The function calls are likely to be inlined, so that the generated code would be something like
mov eax, cr0
or eax, 0x10000
mov cr0, eax
The OR with 0x10000 sets bit 16, the Write Protect bit. On early 32-bit x86 CPUs, code running at supervisor level (like the kernel) was always allowed to write all of virtual memory, regardless of whether the page was marked read-only. This bit makes that optional, so that when it is set, such accesses will cause page faults. This line of code probably follows an earlier line which temporarily cleared the bit.

Detecting last mode of operation in an NMI handler

I am writing an NMI handler in an LKM. I would like to know the mode(user or kernel) of operation during the NMI fire. Is there any kernel flag to denote that? I am running Linux 4.18.0.
You can determine if cpu was in user or kernel mode by value of CS register, which is saved on stack by CPU in addition to RIP, RSP, SS etc.
Stack layout of interrupts is described in Intel® 64 and IA-32 ArchitecturesSoftware Developer’s ManualVolume 3A:System Programming Guide, Part Section 6.12.1
In kernel mode, saved CS value is __KERNEL_CS, in user mode - __USER_CS.
Code of default kernel nmi handler actually does this in /arch/x86/entry/entry_64.S:
ENTRY(nmi)
...
testb $3, CS-RIP+8(%rsp)
jz .Lnmi_from_kernel

Is atomic.LoadUint32 necessary?

Go's atomic package provides function func LoadUint32(addr *uint32) (val uint32). I looked into the assembly implementation:
TEXT ·LoadUint32(SB),NOSPLIT,$0-12
MOVQ addr+0(FP), AX
MOVL 0(AX), AX
MOVL AX, val+8(FP)
RET
which basically load the value from the memory address and return it.
I'm wondering if we have a uint32 pointer(addr) x, what is the difference between calling atomic.LoadUint32(x) and directly access it using *x?
which basically load the value from the memory address and return it.
That is the case in your context, but might differ on a different machine architecture where atomicity is to be implemented, as discussed here.
As mentioned in go issue 8739
We intrinsify both sync/atomic and runtime/internal/atomic for a bunch of architectures.
The APIs are not unified (e.g. LoadUint32 in sync/atomic is Load in runtime/internal/atomic).
(* "intrinsify" as in issue 4947)
As mentioned in my first link:
Regarding loads and stores.
Memory model along with instruction set specifies whether plain loads and stores are atomic or not. Typical guarantee for all modern commodity hardware is that aligned word-sized loads and stores are atomic. For example, on x86 architecture (IA-32 and Intel 64) 1-, 2-, 4-, 8- and 16-byte aligned loads and stores are all atomic (that is, plain MOV instruction, MOVQ and MOVDQA are atomic).

Page table in Linux kernel space during boot

I feel confuse in page table management in Linux kernel ?
In Linux kernel space, before page table is turned on. Kernel will run in virtual memory with 1-1 mapping mechanism. After page table is turned on, then kernel has consult page tables to translate a virtual address into a physical memory address.
Questions are:
At this time, after turning on page table, kernel space is still 1GB (from 0xC0000000 - 0xFFFFFFFF ) ?
And in the page tables of kernel process, only page table entries (PTE) in range from 0xC0000000 - 0xFFFFFFFF are mapped ?. PTEs are out of this range will be not mapped because kernel code never jump there ?
Mapping address before and after turning on page table is same ?
Eg. before turning on page table, the virtual address 0xC00000FF is mapped to physical address 0x000000FF, then after turning on page table, above mapping does not change. virtual address 0xC00000FF is still mapped to physical address 0x000000FF. Different thing is only that after turning on page table, CPU has consult the page table to translate virtual address to physical address which no need to do before.
The page table in kernel space is global and will be shared across all process in the system including user process ?
This mechanism is same in x86 32bit and ARM ?
The following discussion is based on 32-bit ARM Linux, and version of kernel source code is 3.9
All your questions can be addressed if you go through the procedure of setting up the initial page table(which will be overwitten later by function paging_init ) and turning on MMU.
When kernel is first launched by bootloader, Assembly function stext(in arch\arm\kernel\head.s) is the first function to run. Note that MMU has not been turned on yet at this moment.
Among other things, the two import jobs done by this function stext is:
create the initial page tabel(which will be overwitten later by
function paging_init )
turn on MMU
jump to C part of kernel initialization code and carry on
Before delving into the your questions, it is benificial to know:
Before MMU is turned on, every address issued by CPU is physical
address
After MMU is turned on, every address issued by CPU is virtual address
A proper page table should be set up before turning on MMU, otherwise your code will simply "be blown away"
By convention, Linux kernel uses higher 1GB part of virtual address and user land uses the lower 3GB part
Now the tricky part:
First trick: using position-independent code.
Assembly function stext is linked to address "PAGE_OFFSET + TEXT_OFFSET"(0xCxxxxxxx), which is a virtual address, however, since MMU has not been turned on yet, the actual address where assembly function stext is running is "PHYS_OFFSET + TEXT_OFFSET"(the actual value depends on your actual hardware), which is a physical address.
So, here is the thing: the program of function stext "thinks" that it is running in address like 0xCxxxxxxx but it is actually running in address (0x00000000 + some_offeset)(say your hardware configures 0x00000000 as the starting point of RAM). So before turning on MMU, the assembly code need to be very carefully written to make sure that nothing goes wrong during the execution procedure. In fact a techinque called position-independent code(PIC) is used.
To further explain the above, I extract several assembly code snippets:
ldr r13, =__mmap_switched # address to jump to after MMU has been enabled
b __enable_mmu # jump to function "__enable_mmu" to turn on MMU
Note that the above "ldr" instruction is a pseudo instruction which means "get the (virtual) address of function __mmap_switched and put it into r13"
And function __enable_mmu in turn calls function __turn_mmu_on:
(Note that I removed several instructions from function __turn_mmu_on which are essential instructions to the function but not of our interest)
ENTRY(__turn_mmu_on)
mcr p15, 0, r0, c1, c0, 0 # write control reg to enable MMU====> This is where MMU is turned on, after this instruction, every address issued by CPU is "virtual address" which will be translated by MMU
mov r3, r13 # r13 stores the (virtual) address to jump to after MMU has been enabled, which is (0xC0000000 + some_offset)
mov pc, r3 # a long jump
ENDPROC(__turn_mmu_on)
Second trick: identical mapping when setting up initial page table before turning on MMU.
More specifically, the same address range where kernel code is running is mapped twice.
The first mapping, as expected, maps address range 0x00000000(again,
this address depends on hardware config) through (0x00000000 +
offset) to 0xCxxxxxxx through (0xCxxxxxxx + offset)
The second mapping, interestingly, maps address range 0x00000000
through (0x00000000 + offset) to itself(i.e.: 0x00000000 -->
(0x00000000 + offset))
Why doing that?
Remember that before MMU is turned on, every address issued by CPU is physical address(starting at 0x00000000) and after MMU is turned on, every address issued by CPU is virtual address(starting at 0xC0000000).
Because ARM is a pipeline structure, at the moment MMU is turned on, there are still instructions in ARM's pipeine that are using (physical) addresses that are generated by CPU before MMU is turned on! To avoid these instructions to get blown up, an identical mapping has to be set up to cater them.
Now returning to your questions:
At this time, after turning on page table, kernel space is still 1GB (from 0xC0000000 - 0xFFFFFFFF ) ?
A: I guess you mean turning on MMU. The answer is yes, kernel space is 1GB(actually it also occupies several mega bytes below 0xC0000000, but that is not of our interest)
And in the page tables of kernel process, only page table entries (PTE) in range from 0xC0000000 - 0xFFFFFFFF are mapped ?. PTEs are out
of this range will be not mapped because kernel code never jump there
?
A: While the answer to this question is quite complicated because it involves lot of details regarding specific kernel configurations.
To fully answer this question, you need to read the part of kernel source code that set up the initial page table(assembly function __create_page_tables) and the function which sets up the final page table(C function paging_init).
To put it simple, there are two levels of page table in ARM, the first page table is PGD, which occupies 16KB. Kernel first zeros out this PGD during initialization process and does the initial mapping in assembly function __create_page_tables. In function __create_page_tables, only a very small portion of address space is mapped.
After that, the final page table is set up in function paging_init, and in this function, a quite large portion of address space is mapped. Say if you only have 512M RAM, for most common configurations, this 512M-RAM would be mapping by kernel code section by section(1 section is 1MB). If your RAM is quite large(say 2GB), only a portion of your RAM will be directly mapped.
(I will stop here because there are too many details regarding Question 2)
Mapping address before and after turning on page table is same ?
A: I think I've already answered this question in my explanation of "Second trick: identical mapping when setting up initial page table before turning on MMU."
4 . The page table in kernel space is global and will be shared across
all process in the system including user process ?
A: Yes and no. Yes because all processes share the same copy(content) of kernel page table(higher 1GB part). No because each process uses its own 16KB memory to store the kernel page table(although the content of page table for higher 1GB part is identical for every process).
5 . This mechanism is same in x86 32bit and ARM ?
Different Architectures use different mechanism
When Linux enables the MMU, it is only required that the virtual address of the kernel space is mapped. This happens very early in booting. At this point, there is no user space. There is no restrictions that the MMU can map multiple virtual addresses to the same physical address. So, when enabling the MMU, it is simplest to have a virt==phys mapping for the kernel code space and the mapping link==phys or the 0xC0000000 mapping.
Mapping address before and after turning on page table is same ?
If the physical code address is Oxff and the final link address is 0xc00000FF, then we have a duplicate mapping when turning on the MMU. Both 0xff and 0xc00000ff map to the same physical page. A simple jmp (jump) or b (branch) will move from one address space to the other. At this point, the virt==phys mapping can be removed as we are executing at the final destination address.
I think the above should answer points 1 through 3. Basically, the booting page tables are not the final page tables.
4 . The page table in kernel space is global and will be shared across all process in the system including user process?
Yes, this is a big win with a VIVT cache and for many other reasons.
5 . This mechanism is same in x86 32bit and ARM?
Of course the underlying mechanics are different. They are different even for different processors within these families; 486 vs P4 vs Amd-K6; ARM926 vs Cortex-A5 vs Cortex-A8, etc. However, the semantics are very similar.
See: Bootmem#lwn.net - An article on the early Linux memory phase.
Depending on the version, different memory pools and page table mappings are active during boot. The mappings we are all familiar with do not need to be in place until init runs.

Resources