I'm trying to use monitor/mwait instructions to monitor DMA writes from a device to a memory location. In a kernel module (char device) I have the following code (very similar to this piece of kernel code) that runs in a kernel thread:
static int do_monitor(void *arg)
{
struct page *p = arg; // p is a 'struct page *'; it's also remapped to user space
uint32_t *location_p = phys_to_virt(page_to_phys(p));
uint32_t prev = 0;
int i = 0;
while (i++ < 20) // to avoid infinite loop
{
if (*location_p == prev)
{
__monitor(location_p, 0, 0);
if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
clflush(location_p);
if (*location_p == prev)
__mwait(0, 0);
}
prev = *location_p;
printk(KERN_NOTICE "%d", prev);
}
}
In user space I have the following test code:
int fd = open("/dev/mon_test_dev", O_RDWR);
unsigned char *mapped = (unsigned char *)mmap(0, mmap_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
for (int i = 1; i <= 5; ++i)
*mapped = i;
munmap(mapped, mmap_size);
close(fd);
And the kernel log looks like this:
1
2
3
4
5
5
5
5
5
5
5 5 5 5 5 5 5 5 5 5
I.e. it seems that mwait doesn't wait at all.
What could be the reason?
The definition of MONITOR/MWAIT semantics does not specify explicitly whether DMA transactions may or may not trigger it. It is supposed that triggering happens for logical processor's stores.
Current descriptions of MONITOR and MWAIT in the Intel's official Software Developer Manual are quite vague to that respect. However, there are two clauses in the MONITOR section that caught my attention:
The content of EAX is an effective address (in 64-bit mode, RAX is used). By default, the DS segment is used to create a linear address that is monitored.
The address range must use memory of the write-back type. Only write-back memory will correctly trigger the
monitoring hardware.
The first clause states that MONITOR is meant to be used with linear addresses, not physical ones. Devices and their DMA are meant to work with physical addresses only. So basically this means that all agents relying on the same MONITOR range should operate in the same domain of virtual memory space.
The second clause requires the monitored memory region to be cacheable (write-back, WB). For DMA, respective memory range is usually has to be marked as uncacheable, or write-combining at best (UC or WC). This is even a stronger indicator that your intent to use MONITOR/MWAIT to be triggered by DMA is very unlikely to work on current hardware.
Considering your high-level goal - to be able to tell when a device has written to given memory range - I cannot remember any robust method to achieve it, besides using virtualization for devices (VTd, IOMMU etc.) Basically, the classic approach for a peripheral device is to issue an interrupt when it is done with writing to memory. Until an interrupt arrives, there is no way for CPU to tell if all DMA bytes have successfully reached their destination in memory.
Device virtualization allows to abstract physical addresses from a device in a transparent manner, and have an equivalent of a page fault when it attempts to write/read from memory.
Related
I want to read certain performance counters. I know that there are tools like perf, that can do it for me in the user space itself, I want the code to be inside the Linux kernel.
I want to write a mechanism to monitor performance counters on Intel(R) Core(TM) i7-3770 CPU. On top of using I am using Ubuntu kernel 4.19.2. I have gotten the following method from easyperf
Here's part of my code to read instructions.
struct perf_event_attr *attr
memset (&pe, 0, sizeof (struct perf_event_attr));
pe.type = PERF_TYPE_HARDWARE;
pe.size = sizeof (struct perf_event_attr);
pe.config = PERF_COUNT_HW_INSTRUCTIONS;
pe.disabled = 0;
pe.exclude_kernel = 0;
pe.exclude_user = 0;
pe.exclude_hv = 0;
pe.exclude_idle = 0;
fd = syscall(__NR_perf_event_open, hw, pid, cpu, grp, flags);
uint64_t perf_read(int fd) {
uint64_t val;
int rc;
rc = read(fd, &val, sizeof(val));
assert(rc == sizeof(val));
return val;
}
I want to put the same lines in the kernel code (in the context switch function) and check the values being read.
My end goal is to figure out a way to read performance counters for a process, every time it switches to another, from the kernel(4.19.2) itself.
To achieve this I check out the code for the system call number __NR_perf_event_open. It can be found here
To make to usable I copied the code inside as a separate function, named it perf_event_open() in the same file and exported.
Now the problem is whenever I call perf_event_open() in the same way as above, the descriptor returned is -2. Checking with the error codes, I figured out that the error was ENOENT. In the perf_event_open() man page, the cause of this error is defined as wrong type field.
Since file descriptors are associated to the process that's opened them, how can one use them from the kernel? Is there an alternative way to configure the pmu to start counting without involving file descriptors?
You probably don't want the overhead of reprogramming a counter inside the context-switch function.
The easiest thing would be to make system calls from user-space to program the PMU (to count some event, probably setting it to count in kernel mode but not user-space, just so the counter overflows less often).
Then just use rdpmc twice (to get start/stop counts) in your custom kernel code. The counter will stay running, and I guess the kernel perf code will handle interrupts when it wraps around. (Or when its PEBS buffer is full.)
IDK if it's possible to program a counter so it just wraps without interrupting, for use-cases like this where you don't care about totals or sample-based profiling, and just want to use rdpmc. If so, do that.
Old answer, addressing your old question which was based on a buggy printf format string that was printing non-zero garbage even though you weren't counting anything in user-space either.
Your inline asm looks correct, so the question is what exactly that PMU counter is programmed to count in kernel mode in the context where your code runs.
perf virtualizes the PMU counters on context-switch, giving the illusion of perf stat counting a single process even when it migrates across CPUs. Unless you're using perf -a to get system-wide counts, the PMU might not be programmed to count anything, so multiple reads would all give 0 even if at other times it's programmed to count a fast-changing event like cycles or instructions.
Are you sure you have perf set to count user + kernel events, not just user-space events?
perf stat will show something like instructions:u instead of instructions if it's limiting itself to user-space. (This is the default for non-root if you haven't lowered sysctl kernel.perf_event_paranoid to 0 or something from the safe default that doesn't let user-space learn anything about the kernel.)
There's HW support for programming a counter to only count when CPL != 0 (i.e. not in ring 0 / kernel mode). Higher values for kernel.perf_event_paranoid restrict the perf API to not allow programming counters to count in kernel+user mode, but even with paranoid = -1 it's possible to program them this way. If that's how you programmed a counter, then that would explain everything.
We need to see your code that programs the counters. That doesn't happen automatically.
The kernel doesn't just leave the counters running all the time when no process has used a PAPI function to enable a per-process or system-wide counter; that would generate interrupts that slow the system down for no benefit.
I was wondering if there is an existing system call/API for accessing getting the physical address of the virtual address?
If there is none then some direction on how to get that working ?
Also, how to get the physical address of MMIO which is non-pageable physical memory ?
The answer lies in IOMemoryDescriptor and IODMACommand objects.
If the memory in question is kernel-allocated, it should be allocated by creating an IOBufferMemoryDescriptor in the first place. If that's not possible, or if it's a buffer allocated in user space, you can wrap the relevant pointer using IOMemoryDescriptor::withAddressRange(address, length, options, task) or one of the other factory functions. In the case of withAddressRange, the address passed in must be virtual, in the address space of task.
You can directly grab physical address ranges from an IOMemoryDescriptor by calling the getPhysicalSegment() function (only valid between prepare()…complete() calls). However, normally you would do this for creating scatter-gather lists (DMA), and for this purpose Apple strongly recommends the IODMACommand. You can create these using IODMACommand::withSpecification(). Then use the genIOVMSegments() function to generate the scatter-gather list.
Modern Macs, and also some old PPC G5s contain an IOMMU (Intel calls this VT-d), so the system memory addresses you pass to PCI/Thunderbolt devices are not in fact physical, but IO-Mapped. IODMACommand will do this for you, as long as you use the "system mapper" (the default) and set mappingOptions to kMapped. If you're preparing addresses for the CPU, not a device, you will want to turn off mapping - use kIOMemoryMapperNone in your IOMemoryDescriptor options. Depending on what exactly you're trying to do, you probably don't need IODMACommand in this case either.
Note: it's often wise to pool and reuse your IODMACommand objects, rather than freeing and reallocating them.
Regarding MMIO, I assume you mean PCI BARs and similar - for IOPCIDevice, you can grab an IOMemoryDescriptor representing the memory-mapped device range using getDeviceMemoryWithRegister() and similar functions.
Example:
If all you want are pure CPU-space physical addresses for a given virtual memory range in some task, you can do something like this (untested as a complete kext that uses it would be rather large):
// INPUTS:
mach_vm_address_t virtual_range_start = …; // start address of virtual memory
mach_vm_size_t virtual_range_size_bytes = …; // number of bytes in range
task_t task = …; // Task object of process in which the virtual memory address is mapped
IOOptionBits direction = kIODirectionInOut; // whether the memory will be written or read, or both during the operation
IOOptionBits options =
kIOMemoryMapperNone // we want raw physical addresses, not IO-mapped
| direction;
// Process for getting physical addresses:
IOMemoryDescriptor* md = IOMemoryDescriptor::withAddressRange(
virtual_range_start, virtual_range_size_bytes, direction, task);
// TODO: check for md == nullptr
// Wire down virtual range to specific physical pages
IOReturn result = md->prepare(direction);
// TODO: do error handling
IOByteCount offset = 0;
while (offset < virtual_range_size_bytes)
{
IOByteCount segment_len = 0;
addr64_t phys_addr = md->getPhysicalSegment(offset, &len, kIOMemoryMapperNone);
// TODO: do something with physical range of segment_len bytes at address phys_addr here
offset += segment_len;
}
/* Unwire. Call this only once you're done with the physical ranges
* as the pager can change the physical-virtual mapping outside of
* prepare…complete blocks. */
md->complete(direction);
md->release();
As explained above, this is not suitable for generating DMA scatter-gather lists for device I/O. Note also this code is only valid for 64-bit kernels. You'll need to be careful if you still need to support ancient 32-bit kernels (OS X 10.7 and earlier) because virtual and physical addresses can still be 64-bit (64-bit user processes and PAE, respectively), but not all memory descriptor functions are set up for that. There are 64-bit-safe variants available to be used for 32-bit kexts.
Memory in the Linux kernel is usually unswappable (Do Kernel pages get swapped out?). However, sometimes it is useful to allow memory to be swapped out. Is it possible to explicitly allocate swappable memory inside the Linux kernel? One method I thought of was to create a user space process and use its memory. Is there anything better?
You can create a file in the internal shm shared memory filesystem.
const char *name = "example";
loff_t size = PAGE_SIZE;
unsigned long flags = 0;
struct file *filp = shmem_file_setup(name, size, flags);
/* assert(!IS_ERR(filp)); */
The file isn't actually linked, so the name isn't visible. The flags may include VM_NORESERVE to skip accounting up-front, instead accounting as pages are allocated. Now you have a shmem file. You can map a page like so:
struct address_space *mapping = filp->f_mapping;
pgoff_t index = 0;
struct page *p = shmem_read_mapping_page(mapping, index);
/* assert(!IS_ERR(filp)); */
void *data = page_to_virt(p);
memset(data, 0, PAGE_SIZE);
There is also shmem_read_mapping_page_gfp(..., gfp_t) to specify how the page is allocated. Don't forget to put the page back when you're done with it.
put_page(p);
Ditto with the file.
fput(filp);
Answer to your question is a simple No, or Yes with a complex modification to kernel source.
First, to enable swapping out, you have to ask yourself what is happening when kswapd is swapping out. Essentially it will walk through all the processes and make a decision whether its memory can be swapped out or not. And all these memory have the hardware mode of ring 3. So SMAP essentially forbid it from being read as data or executed as program in the kernel (ring 0):
https://en.wikipedia.org/wiki/Supervisor_Mode_Access_Prevention
And check your distros "CONFIG_X86_SMAP", for mine Ubuntu it is default to "y" which is the case for past few years.
But if you keep your memory as a kernel address (ring 0), then you may need to consider changing the kswapd operation to trigger swapout of kernel addresses. Whick kernel addresses to walk first? And what if the address is part of the kswapd's kernel operation? The complexities involved is huge.
And next is to consider the swap in operation: When the memory read is attempted and it's "not present" bit is enabled, then hardware exception will trigger linux kernel memory fault handler (which is __do_page_fault()).
And looking into __do_page_fault:
https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/fault.c#L1477
and there after how it handler the kernel addresses (do_kern_address_fault()):
https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/fault.c#L1174
which essentially is just reporting as error for possible scenario. If you want to enable kernel address pagefaulting, then this path has to be modified.
And note too that the SMAP check (inside smap_violation) is done in the user address pagefaulting (do_usr_addr_fault()).
On Linux machine, trying to write driver and trying to map some kernel memory to application for performance gains.
Checking driver implementation for mmap online, finding different varieties of implementation.
As per man pages, mmap - creates new mapping in virtual address space of calling process.
1) Who allocates physical address space during mmap calling? kernel or device driver?
seen following varieties of driver mmap implementation.
a) driver creates contiguous physical kernel memory and maps it with process address space.
static int driver_mmap(struct file *filp, struct vm_area_struct *vma)
{
unsigned long size = vma->vm_end - vma->vm_start;
pos = kmalloc(size); //allocate contiguous physical memory.
while (size > 0) {
unsigned long pfn;
pfn = virt_to_phys((void *) pos) >> PAGE_SHIFT; // Get Page frame number
if (remap_pfn_range(vma, start, pfn, PAGE_SIZE, PAGE_SHARED)) // creates mapping
return -EAGAIN;
start += PAGE_SIZE;
pos += PAGE_SIZE;
size -= PAGE_SIZE;
}
}
b) driver creates virtual kernel memory and maps it with process address space.
static struct vm_operations_struct dr_vm_ops = {
.open = dr_vma_open,
.close = dr_vma_close,
};
static int driver_mmap(struct file *filp, struct vm_area_struct *vma)
{
unsigned long size = vma->vm_end - vma->vm_start;
void *kp = vmalloc(size);
unsigned long up;
for (up = vma->vm_start; up < vma->vm_end; up += PAGE_SIZE) {
struct page *page = vmalloc_to_page(kp); //Finding physical page from virtual address
err = vm_insert_page(vma, up, page); //How is it different from remap_pfn_range?
if (err)
break;
kp += PAGE_SIZE;
}
vma->vm_ops = &dr_vm_ops;
ps_vma_open(vma);
}
c) not sure who allocates memory in this case.
static int driver_mmap(struct file *filp, struct vm_area_struct *vma)
{
if (remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
vma->vm_end - vma->vm_start,
vma->vm_page_prot)) // creates mapping
return -EAGAIN;
}
2) If kernel allocates memory for mmap, is n't memory wasted in a&b cases?
3) remap_pfn_range is to map multiple pages where as vm_insert_page is just for single page mapping. is it the only difference b/w these two APIs?
Thank You,
Gopinath.
Which you use depends on what you're trying to accomplish.
(1) A device driver is part of the kernel so it doesn't really make sense to differentiate that way. For these cases, the device driver is asking for memory to be allocated for its own use from the (physical) memory resources available to the entire kernel.
With (a), a physically contiguous space is being allocated. You might do this if there is some piece of external hardware (a PCI device, for example) that will be reading or writing that memory. The return value from kmalloc already has a mapping to kernel virtual address space. remap_pfn_range is being used to map the page into the user virtual address space of the current process as well.
For (b), a virtually contiguous space is being allocated. If there is no external hardware involved, this is what you would typically use. There is still physical memory being allocated to your driver, but it isn't guaranteed that the pages are physically contiguous -- hence fewer constraints on which pages can be allocated. (They will still be contiguous in kernel virtual address space.) And then you are simply using a different API to implement the same kind of mapping into user virtual address space.
For (c), the memory being mapped is allocated under control of some other subsystem. Thevm_pgoff field has already been set to the base physical address of the resource. For example, the memory might correspond to a PCI device's address region (a network interface controller's registers, say) where that physical address is determined/assigned by your BIOS (or whatever mechanism your machine uses).
(2) Not sure I understand this question. How can the memory be "wasted" if it's being used by the device driver and a cooperating user process? And if the kernel needs to read and write the memory, there must be kernel virtual address space allocated and it needs to be mapped to the underlying physical memory. Likewise, if the user space process is to access the memory, there must be user virtual address space allocated and that must be mapped to the physical memory as well.
"Allocating virtual address space" essentially just means allocating page table entries for the memory. That is done separately from actually allocating the physical memory. And it's done separately for kernel space and user space. And "mapping" means setting the page table entry (the virtual address of the beginning of the page) to point to the correct physical page address.
(3) Yes. They are different APIs that accomplish much the same thing. Sometimes you have a struct page, sometimes you have a pfn. It can be confusing: there are often several ways to accomplish the same thing. Developers typically use the one most obvious for the item they already have ("I already have a struct page. I could calculate its pfn. But why do that when there's this other API that accepts a struct page?").
I have existing code that takes a list of struct page * and builds a descriptor table to share memory with a device. The upper layer of that code currently expects a buffer allocated with vmalloc or from user space, and uses vmalloc_to_page to obtain the corresponding struct page *.
Now the upper layer needs to cope with all kinds of memory, not just memory obtained through vmalloc. This could be a buffer obtained with kmalloc, a pointer inside the stack of a kernel thread, or other cases that I'm not aware of. The only guarantee I have is that the caller of this upper layer must ensure that the memory buffer in question is mapped in kernel space at that point (i.e. it is valid to access buffer[i] for all 0<=i<size at this point). How do I obtain a struct page* corresponding to an arbitrary pointer?
Putting it in pseudo-code, I have this:
lower_layer(struct page*);
upper_layer(void *buffer, size_t size) {
for (addr = buffer & PAGE_MASK; addr <= buffer + size; addr += PAGE_SIZE) {
struct page *pg = vmalloc_to_page(addr);
lower_layer(pg);
}
}
and I now need to change upper_layer to cope with any valid buffer (without changing lower_layer).
I've found virt_to_page, which Linux Device Drivers indicates operates on “a logical address, [not] memory from vmalloc or high memory”. Furthermore, is_vmalloc_addr tests whether an address comes from vmalloc, and virt_addr_valid tests if an address is a valid virtual address (fodder for virt_to_page; this includes kmalloc(GFP_KERNEL) and kernel stacks). What about other cases: global buffers, high memory (it'll come one day, though I can ignore it for now), possibly other kinds that I'm not aware of? So I could reformulate my question as:
What are all the kinds of memory zones in the kernel?
How do I tell them apart?
How do I obtain page mapping information for each of them?
If it matters, the code is running on ARM (with an MMU), and the kernel version is at least 2.6.26.
I guess what you want is a page table walk, something like (warning, not actual code, locking missing etc):
struct mm_struct *mm = current->mm;
pgd = pgd_offset(mm, address);
pmd = pmd_offset(pgd, address);
pte = *pte_offset_map(pmd, address);
page = pte_page(pte);
But you you should be very very careful with this. the kmalloc address you got might very well be not page aligned for example. This sounds like a very dangerous API to me.
Mapping Addresses to a struct page
There is a requirement for Linux to have a fast method of mapping virtual addresses to physical addresses and for mapping struct pages to their physical address. Linux achieves this by knowing where, in both virtual and physical memory, the global mem_map array is because the global array has pointers to all struct pages representing physical memory in the system. All architectures achieve this with very similar mechanisms, but, for illustration purposes, we will only examine the x86 carefully.
Mapping Physical to Virtual Kernel Addresses
any virtual address can be translated to the physical address by simply subtracting PAGE_OFFSET, which is essentially what the function virt_to_phys() with the macro __pa() does:
/* from <asm-i386/page.h> */
132 #define __pa(x) ((unsigned long)(x)-PAGE_OFFSET)
/* from <asm-i386/io.h> */
76 static inline unsigned long virt_to_phys(volatile void * address)
77 {
78 return __pa(address);
79 }
Obviously, the reverse operation involves simply adding PAGE_OFFSET, which is carried out by the function phys_to_virt() with the macro __va(). Next we see how this helps the mapping of struct pages to physical addresses.
There is one exception where virt_to_phys() cannot be used to convert virtual addresses to physical ones. Specifically, on the PPC and ARM architectures, virt_to_phys() cannot be used to convert addresses that have been returned by the function consistent_alloc(). consistent_alloc() is used on PPC and ARM architectures to return memory from non-cached for use with DMA.
What are all the kinds of memory zones in the kernel? <---see here
For user-space allocated memory, you want to use get_user_pages, which will give you the list of pages associated with the malloc'd memory, and also increment their reference counter (you'll need to call page_cache_release on each page once done with them.)
For vmalloc'd pages, vmalloc_to_page is your friend, and I don't think you need to do anything.
For 64 bit architectures, the answer of gby should be adapted to:
pgd_t * pgd;
pmd_t * pmd;
pte_t * pte;
struct page *page = NULL;
pud_t * pud;
void * kernel_address;
pgd = pgd_offset(mm, address);
pud = pud_offset(pgd, address);
pmd = pmd_offset(pud, address);
pte = pte_offset_map(pmd, address);
page = pte_page(*pte);
// mapping in kernel memory:
kernel_address = kmap(page);
// work with kernel_address....
kunmap(page);
You could try virt_to_page. I am not sure it is what you want, but at least it is somewhere to start looking.