Modifying current process' pte through /dev/mem? - linux-kernel

AFAIK, /dev/mem presents physical memory to user, and it's usually being used for device read/write through MMIO. In my use case, I want to modify current process' pte so that two ptes will point to the same physical page. In particular, I move a x86_64 binary above 4G virtual space and mmap virtual space below 4G. I want to make 4G above pte and 4G below pte point to the same physical page, so that when I write into 4G above vaddr and read from 4G below pte, I get the same result. Sample code might look like below:
*(unsigned char *)vaddr1 = 7 // write into 4G above vaddr1
val = *(unsigned char *)vaddr2; // read from 4G below vaddr2
printf("val should be 7, %d\n", val);
But after I modify 4G below pte to point to physical page pointed by 4G above pte through /dev/mem, kernel give me message below,
BUG: Bad page map in process mmap pte:8000000007eb2067 pmd:07acb067
page:ffffea00001fac80 count:0 mapcount:-1 mapping: (null) index:0x101b7b
page flags: 0x4000000000000014(referenced|dirty)
addr:0000000101b7b000 vm_flags:00100073 anon_vma:ffff880007ab0708 mapping: (null) index:101b7b
Pid: 609, comm: mmap Tainted: G B 3.5.3 #7
Call Trace:
[<ffffffff8107abcc>] ? print_bad_pte+0x1d2/0x1ea
[<ffffffff8107bf18>] ? unmap_single_vma+0x3a0/0x56d
[<ffffffff8107c745>] ? unmap_vmas+0x2c/0x46
[<ffffffff8108106b>] ? exit_mmap+0x6e/0xdd
[<ffffffff8101cc4f>] ? do_page_fault+0x30f/0x348
[<ffffffff81020ce6>] ? mmput+0x20/0xb4
[<ffffffff810256ae>] ? exit_mm+0x105/0x110
[<ffffffff8103bb6c>] ? hrtimer_try_to_cancel+0x67/0x70
[<ffffffff81026b59>] ? do_exit+0x211/0x711
[<ffffffff810272e0>] ? do_group_exit+0x76/0xa0
[<ffffffff8102731c>] ? sys_exit_group+0x12/0x19
[<ffffffff812f3662>] ? system_call_fastpath+0x16/0x1b
BUG: Bad rss-counter state mm:ffff880007a496c0 idx:0 val:-1
BUG: Bad rss-counter state mm:ffff880007a496c0 idx:1 val:1
I guess kernel will examine if the pte has been modified, and I did something wrong. Here are vaddr1 and vaddr2's pte before/after my pte rewriting.
above 4G pte: 0x8000000007eb2067
below 4G pte: 0x0000000007ea7067
after rewriting pte...
above 4G pte: 0x8000000007eb2067
below 4G pte: 0x8000000007eb2067
Any idea? Thanks.
Note: Now I know I should release the physical page pointed by vaddr2's pte, otherwise kernel will note that physical page isn't pointed by any pte and give those error. But how? I try to use __free_page, but get error below.
BUG: unable to handle kernel paging request at ffffebe00008001c
IP: [<ffffffff8106b908>] __free_pages+0x4/0x2a
PGD 0
Oops: 0000 [#2] PREEMPT SMP
CPU 0

Related

Windbg vtop outputs physical address larger than memory

I'm studying the internals of the Windows kernel and one of the things I'm looking into is how paging and virtual addresses work in Windows. I was experimenting with windbg's !vtop function when I noticed something strange I was getting an impossible physical address?
For example here is my output of a !process 0 0 command:
PROCESS fffffa8005319b30
SessionId: none Cid: 0104 Peb: 7fffffd8000 ParentCid: 0004
DirBase: a8df3000 ObjectTable: fffff8a0002f6df0 HandleCount: 29.
Image: smss.exe
when I run !vtop a8df3000 fffffa8005319b30. I get the following result:
lkd> !vtop a8df3000 fffffa8005319b30
Amd64VtoP: Virt fffffa80`05319b30, pagedir a8df3000
Amd64VtoP: PML4E a8df3fa8
Amd64VtoP: PDPE 2e54000
Amd64VtoP: PDE 2e55148
Amd64VtoP: Large page mapped phys 1`3eb19b30
Virtual address fffffa8001f07310 translates to physical address 13eb19b30
The problem I have with this is that my VM that I'm running this test on only has 4GB and 13eb19b30 is 5,346,794,288...
When I run !dd 13eb19b30 and dd fffffa8001f07310 I get the same result so windows seems to be able to access this physical address somehow... Does anyone know how this is done?
I found this post on Cheat Engine that looks like he had a similar problem to me. But they found no solution in that case either
I see You have posted this is RESE also i saw it there didn't understand exactly what you are trying to do.
i see a few discrepancies
you seemed to have used a PFN a8df3000 but it seems windbg seems to be using a PFN of 187000 instead
btw pfn iirc should be dirbase & 0xfffff000
also for virtual address you seem to using the EPROCESS address of your process
are you sure that this is the right virtual address you want to use ?
also it seems you are using lkd which is local kernel debugging prompt
and i hope you understand that lkd is not real kernel debugging
So I think I was finally able to come with a reasonable answer to the problem. It turns out that vmware doesn't seem to actually expose to the VM contiguous memory but instead segments it into different memory "runs". I was able to confirm this by using the volatility:
$ python vol.py -f ~/Desktop/Win7SP1x64-d8737a34.vmss vmwareinfo --verbose | less
Magic: 0xbad1bad1 (Version 1)
Group count: 0x5c
File Offset PhysMem Offset Size
----------- -------------- ----------
0x000010000 0x000000000000 0xc0000000
0x0c0010000 0x000100000000 0xc0000000
Here is a volatility github wiki article that goes into more detail about: volatility

How can my PCI device driver remap PCI memory to userspace?

I am trying to implement a PCI device driver for a virtual PCI device on QEMU. The device defines a BAR region as RAM, and the driver can do ioremap() this region and access it without any issues. The next step is to assign this region (or a fraction of it) to a user application.
To do this, I have also implemented an .mmap function as part of my driver file operations. This mmap is simply using remap_pfn_range, but it also passes the pfn of the memory pointer returned by the ioremap() earlier.
However, upon running the user space application, the mmap is successful, but when the app tries to access the memory, it gets killed and I get the following dmesg errors.
"
a.out: Corrupted page table at address 7f66248b8000
..Some page table info..
Bad pagetable: 000f [#2] SMP NOPTI
..and the core dump..
"
Does anyone know what have I done wrong? Did I missed a step? Or it could be an error specific to QEMU?
I am running x86_softmmu as my QEMU configuration and my kernel is the 4.14
I've solved this issue and managed to map PCI memory to user space via the driver. As #IanAbbott implied, I've changed the pfn input of the remap_pfn_range() function I was using in my custom ->mmap().
The original was:
io_remap_pfn_range(vma, vma->vm_start, pfn, vma->vm_end - vma->vm_start, vma->vm_page_prot));
where the pfn was the result of the buffer pointer return from the ioremap(). I changed the pfn to:
pfn = pci_resource_start(pdev, BAR) >> PAGE_SHIFT;
That basically points to the actual starting address pointed by the BAR. My working remap_pfn_range() function is now:
io_remap_pfn_range(vma, vma->vm_start, pci_resource_start(pdev, BAR) >> PAGE_SHIFT, vma->vm_end - vma->vm_start,vma->vm_page_prot);
I confirmed that it works by doing some dummy writes to the buffer pointer in my driver, then picking up the reads and doing some writes in my user space application.

Kernel panic using deferred_io on kmalloced buffer

I'm writing a framebuffer for an SPI LCD display on ARM. Before I complete that, I've written a memory only driver and trialled it under Ubuntu (Intel, Virtualbox). The driver works fine - I've allocated a block of memory using kmalloc, page aligned it (it's page aligned anyway actually), and used the framebuffer system to create a /dev/fb1. I have my own mmap function if that's relevant (deferred_io ignores it and uses its own by the look of it).
I have set:
info->screen_base = (u8 __iomem *)kmemptr;
info->fix.smem_len = kmem_size;
When I open /dev/fb1 with a test program and mmap it, it works correctly. I can see what is happening x11vnc to "share" the fb1 out:
x11vnc -rawfb map:/dev/fb1#320x240x16
And view with a vnc viewer:
gvncviewer strontium:0
I've made sure I've no overflows by writing to the entire mmapped buffer and that seems to be fine.
The problem arises when I add in deferred_io. As a test of it, I have a delay of 1 second and the called deferred_io function does nothing except a pr_devel() print. I followed the docs.
Now, the test program opens /dev/fb1 fine, mmap returns ok but as soon as I write to that pointer, I get a kernel panic. The following dump is from the ARM machine actually but it panics on the Ubuntu VM as well:
root#duovero:~/testdrv# ./fbtest1 /dev/fb1
Device opened: /dev/fb3
Screen is: 320 x 240, 16 bpp
Screen size = 153600 bytes
mmap on device succeeded
Unable to handle kernel paging request at virtual address bf81e020
pgd = edbec000
[bf81e020] *pgd=00000000
Internal error: Oops: 5 [#1] SMP ARM
Modules linked in: hhlcd28a(O) sysimgblt sysfillrect syscopyarea fb_sys_fops bnep ipv6 mwifiex_sdio mwifiex btmrvl_sdio firmware_class btmrvl cfg80211 bluetooth rfkill
CPU: 0 Tainted: G O (3.6.0-hh04 #1)
PC is at fb_deferred_io_fault+0x34/0xb0
LR is at fb_deferred_io_fault+0x2c/0xb0
pc : [<c0271b7c>] lr : [<c0271b74>] psr: a0000113
sp : edbdfdb8 ip : 00000000 fp : edbeedb8
r10: edbeedb8 r9 : 00000029 r8 : edbeedb8
r7 : 00000029 r6 : bf81e020 r5 : eda99128 r4 : edbdfdd8
r3 : c081e000 r2 : f0000000 r1 : 00001000 r0 : bf81e020
Flags: NzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
Control: 10c5387d Table: adbec04a DAC: 00000015
Process fbtest1 (pid: 485, stack limit = 0xedbde2f8)
Stack: (0xedbdfdb8 to 0xedbe0000)
[snipped out hexdump]
[<c0271b7c>] (fb_deferred_io_fault+0x34/0xb0) from [<c00db0c4>] (__do_fault+0xbc/0x470)
[<c00db0c4>] (__do_fault+0xbc/0x470) from [<c00dde0c>] (handle_pte_fault+0x2c4/0x790)
[<c00dde0c>] (handle_pte_fault+0x2c4/0x790) from [<c00de398>] (handle_mm_fault+0xc0/0xd4)
[<c00de398>] (handle_mm_fault+0xc0/0xd4) from [<c049a038>] (do_page_fault+0x140/0x37c)
[<c049a038>] (do_page_fault+0x140/0x37c) from [<c0008348>] (do_DataAbort+0x34/0x98)
[<c0008348>] (do_DataAbort+0x34/0x98) from [<c0498af4>] (__dabt_usr+0x34/0x40)
Exception stack(0xedbdffb0 to 0xedbdfff8)
ffa0: 00000280 0000ffff b6f5c900 00000000
ffc0: 00000003 00000000 00025800 b6f5c900 bea6dc1c 00011048 00000032 b6f5b000
ffe0: 00006450 bea6db70 00000000 000085d6 40000030 ffffffff
Code: 28bd8070 ebffff37 e2506000 0a00001b (e5963000)
---[ end trace 7e5ca57bebd433f5 ]---
Segmentation fault
root#duovero:~/testdrv#
I'm totally stumped - other drivers look more or less the same as mine but I assume they work. Most use vmalloc actually - is there a difference between kmalloc and vmalloc for this purpose?
Confirmed the fix so I'll answer my own question:
deferred_io changes the info mmap to its own that sets up fault handlers for writes to the video memory pages. In the fault handler it
checks bounds against info->fix.smem_len, so you must set that
gets the page that was written to.
For the latter case, it treats vmalloc differently from kmalloc (by checking info->screen_base to see if it's vmalloced). If you have vmalloced, it uses screen_base as the virtual address. If you have not used vmalloc, it assumes that the address of interest is the physical address in info->fix.smem_start.
So, to use deferred_io correctly
set screen_base (char __iomem *) and point that to the virtual address.
set info->fix.smem_len to the video buffer size
if you are not using vmalloc, you must set info->fix.smem_start to the video buffer's physical address by using virt_to_phys(vid_buffer);
Confirmed on Ubuntu as fixing the issue.
Really interesting, I'm currently implementing SPI-based display FB driver too (Sharp Memory LCD display and my VFDHack32 host driver). I also facing similar problem where it crashes at deferred_io. Can you share you source code ? mine is at my GitHub repo. P.S. that Memory LCD display is monochrome so I just pretend to be color display and just check whether the pixel byte is empty (dot off) or not empty (dot on).

How to access physical address on Linux in C?

Just for my study, I'm trying to implement routine to translate from Virtual Address to Physical Address on ARM Linux.
I'm looking at the specification from the material below.
http://homepages.wmich.edu/~grantner/ece6050/ARM7100vA_3.pdf
I'm implementing the routine in Kernel Module, and I could read TTBR0 for the first step.
But I realized I have to access physical address P using TTBR0 and target virtual address.
I didn't know how to access physical address P, and tried ioremap() to get virtual
address V for that physical address P.
However, when I call ioremap(), it doesn't work and dump backtrace as follows.
Does anyone know how the right way to access physical address?
[255075.566078] ------------[ cut here ]------------
[255075.570816] WARNING: at arch/arm/mm/ioremap.c:208 __arm_ioremap_pfn_caller+)
[255075.578806] Modules linked in: tv_driver(+) mtk_stp_gps mtk_fm_priv(P) mt66]
[255075.595096] Backtrace:
[255075.597590] [<c004df38>] (dump_backtrace+0x0/0x10c) from [<c057c8c4>] (dump)
[255075.606186] r7:00000009 r6:000000d0 r5:c005546c r4:00000000
[255075.611800] [<c057c8ac>] (dump_stack+0x0/0x1c) from [<c00817e0>] (warn_slow)
[255075.621378] [<c008178c>] (warn_slowpath_common+0x0/0x6c) from [<c008181c>] )
[255075.630390] r9:ca466000 r8:00000000 r7:4a25b000 r6:00000001 r5:00000004
[255075.636883] r4:00000290
[255075.639616] [<c00817f8>] (warn_slowpath_null+0x0/0x2c) from [<c005546c>] (_)
[255075.649740] [<c005537c>] (__arm_ioremap_pfn_caller+0x0/0x104) from [<c00554)
[255075.660186] [<c00554a0>] (__arm_ioremap_caller+0x0/0x64) from [<c0055518>] )
[255075.669115] r5:00000000 r4:4a25b290
[255075.672730] [<c0055504>] (__arm_ioremap+0x0/0x18) from [<bf144794>] (cl)
[255075.682695] [<bf144720>] (client_init+0x0/0x288 [driver]) from [<c00)
[255075.692774] r7:00022451 r6:00000000 r5:c07834c0 r4:4018a008
[255075.698458] [<c0043628>] (do_one_initcall+0x0/0x1a4) from [<c00bf150>] (sys)
[255075.707624] [<c00bf0c4>] (sys_init_module+0x0/0x1a4) from [<c0049b80>] (ret)
[255075.716673] r7:00000080 r6:40043d6c r5:00022451 r4:4018a008
[255075.722272] ---[ end trace bffd4c38629759a2 ]---
You cannot access directly to physical address while the MMU is active. But you have some ways to do it:
-Write an assembly code to disable MMU ( by writing in CP15 register ), so you will be able to write in any physical adress still in assembly.
-Find the virtual offset, with this offset you will be able to find the virtual address V for the physical address P.
-Use ioremap() function ( in your case i don't know why it doesn't work, i need more information :) )
-Use phys_to_virt function describe in arch/arm/include/asm/memory.h, but it's only for direct map ram memory.

Page table in Linux kernel space during boot

I feel confuse in page table management in Linux kernel ?
In Linux kernel space, before page table is turned on. Kernel will run in virtual memory with 1-1 mapping mechanism. After page table is turned on, then kernel has consult page tables to translate a virtual address into a physical memory address.
Questions are:
At this time, after turning on page table, kernel space is still 1GB (from 0xC0000000 - 0xFFFFFFFF ) ?
And in the page tables of kernel process, only page table entries (PTE) in range from 0xC0000000 - 0xFFFFFFFF are mapped ?. PTEs are out of this range will be not mapped because kernel code never jump there ?
Mapping address before and after turning on page table is same ?
Eg. before turning on page table, the virtual address 0xC00000FF is mapped to physical address 0x000000FF, then after turning on page table, above mapping does not change. virtual address 0xC00000FF is still mapped to physical address 0x000000FF. Different thing is only that after turning on page table, CPU has consult the page table to translate virtual address to physical address which no need to do before.
The page table in kernel space is global and will be shared across all process in the system including user process ?
This mechanism is same in x86 32bit and ARM ?
The following discussion is based on 32-bit ARM Linux, and version of kernel source code is 3.9
All your questions can be addressed if you go through the procedure of setting up the initial page table(which will be overwitten later by function paging_init ) and turning on MMU.
When kernel is first launched by bootloader, Assembly function stext(in arch\arm\kernel\head.s) is the first function to run. Note that MMU has not been turned on yet at this moment.
Among other things, the two import jobs done by this function stext is:
create the initial page tabel(which will be overwitten later by
function paging_init )
turn on MMU
jump to C part of kernel initialization code and carry on
Before delving into the your questions, it is benificial to know:
Before MMU is turned on, every address issued by CPU is physical
address
After MMU is turned on, every address issued by CPU is virtual address
A proper page table should be set up before turning on MMU, otherwise your code will simply "be blown away"
By convention, Linux kernel uses higher 1GB part of virtual address and user land uses the lower 3GB part
Now the tricky part:
First trick: using position-independent code.
Assembly function stext is linked to address "PAGE_OFFSET + TEXT_OFFSET"(0xCxxxxxxx), which is a virtual address, however, since MMU has not been turned on yet, the actual address where assembly function stext is running is "PHYS_OFFSET + TEXT_OFFSET"(the actual value depends on your actual hardware), which is a physical address.
So, here is the thing: the program of function stext "thinks" that it is running in address like 0xCxxxxxxx but it is actually running in address (0x00000000 + some_offeset)(say your hardware configures 0x00000000 as the starting point of RAM). So before turning on MMU, the assembly code need to be very carefully written to make sure that nothing goes wrong during the execution procedure. In fact a techinque called position-independent code(PIC) is used.
To further explain the above, I extract several assembly code snippets:
ldr r13, =__mmap_switched # address to jump to after MMU has been enabled
b __enable_mmu # jump to function "__enable_mmu" to turn on MMU
Note that the above "ldr" instruction is a pseudo instruction which means "get the (virtual) address of function __mmap_switched and put it into r13"
And function __enable_mmu in turn calls function __turn_mmu_on:
(Note that I removed several instructions from function __turn_mmu_on which are essential instructions to the function but not of our interest)
ENTRY(__turn_mmu_on)
mcr p15, 0, r0, c1, c0, 0 # write control reg to enable MMU====> This is where MMU is turned on, after this instruction, every address issued by CPU is "virtual address" which will be translated by MMU
mov r3, r13 # r13 stores the (virtual) address to jump to after MMU has been enabled, which is (0xC0000000 + some_offset)
mov pc, r3 # a long jump
ENDPROC(__turn_mmu_on)
Second trick: identical mapping when setting up initial page table before turning on MMU.
More specifically, the same address range where kernel code is running is mapped twice.
The first mapping, as expected, maps address range 0x00000000(again,
this address depends on hardware config) through (0x00000000 +
offset) to 0xCxxxxxxx through (0xCxxxxxxx + offset)
The second mapping, interestingly, maps address range 0x00000000
through (0x00000000 + offset) to itself(i.e.: 0x00000000 -->
(0x00000000 + offset))
Why doing that?
Remember that before MMU is turned on, every address issued by CPU is physical address(starting at 0x00000000) and after MMU is turned on, every address issued by CPU is virtual address(starting at 0xC0000000).
Because ARM is a pipeline structure, at the moment MMU is turned on, there are still instructions in ARM's pipeine that are using (physical) addresses that are generated by CPU before MMU is turned on! To avoid these instructions to get blown up, an identical mapping has to be set up to cater them.
Now returning to your questions:
At this time, after turning on page table, kernel space is still 1GB (from 0xC0000000 - 0xFFFFFFFF ) ?
A: I guess you mean turning on MMU. The answer is yes, kernel space is 1GB(actually it also occupies several mega bytes below 0xC0000000, but that is not of our interest)
And in the page tables of kernel process, only page table entries (PTE) in range from 0xC0000000 - 0xFFFFFFFF are mapped ?. PTEs are out
of this range will be not mapped because kernel code never jump there
?
A: While the answer to this question is quite complicated because it involves lot of details regarding specific kernel configurations.
To fully answer this question, you need to read the part of kernel source code that set up the initial page table(assembly function __create_page_tables) and the function which sets up the final page table(C function paging_init).
To put it simple, there are two levels of page table in ARM, the first page table is PGD, which occupies 16KB. Kernel first zeros out this PGD during initialization process and does the initial mapping in assembly function __create_page_tables. In function __create_page_tables, only a very small portion of address space is mapped.
After that, the final page table is set up in function paging_init, and in this function, a quite large portion of address space is mapped. Say if you only have 512M RAM, for most common configurations, this 512M-RAM would be mapping by kernel code section by section(1 section is 1MB). If your RAM is quite large(say 2GB), only a portion of your RAM will be directly mapped.
(I will stop here because there are too many details regarding Question 2)
Mapping address before and after turning on page table is same ?
A: I think I've already answered this question in my explanation of "Second trick: identical mapping when setting up initial page table before turning on MMU."
4 . The page table in kernel space is global and will be shared across
all process in the system including user process ?
A: Yes and no. Yes because all processes share the same copy(content) of kernel page table(higher 1GB part). No because each process uses its own 16KB memory to store the kernel page table(although the content of page table for higher 1GB part is identical for every process).
5 . This mechanism is same in x86 32bit and ARM ?
Different Architectures use different mechanism
When Linux enables the MMU, it is only required that the virtual address of the kernel space is mapped. This happens very early in booting. At this point, there is no user space. There is no restrictions that the MMU can map multiple virtual addresses to the same physical address. So, when enabling the MMU, it is simplest to have a virt==phys mapping for the kernel code space and the mapping link==phys or the 0xC0000000 mapping.
Mapping address before and after turning on page table is same ?
If the physical code address is Oxff and the final link address is 0xc00000FF, then we have a duplicate mapping when turning on the MMU. Both 0xff and 0xc00000ff map to the same physical page. A simple jmp (jump) or b (branch) will move from one address space to the other. At this point, the virt==phys mapping can be removed as we are executing at the final destination address.
I think the above should answer points 1 through 3. Basically, the booting page tables are not the final page tables.
4 . The page table in kernel space is global and will be shared across all process in the system including user process?
Yes, this is a big win with a VIVT cache and for many other reasons.
5 . This mechanism is same in x86 32bit and ARM?
Of course the underlying mechanics are different. They are different even for different processors within these families; 486 vs P4 vs Amd-K6; ARM926 vs Cortex-A5 vs Cortex-A8, etc. However, the semantics are very similar.
See: Bootmem#lwn.net - An article on the early Linux memory phase.
Depending on the version, different memory pools and page table mappings are active during boot. The mappings we are all familiar with do not need to be in place until init runs.

Resources