Mmap a writecombine region to userspace documentation - linux-kernel

According to the following documentation
https://www.kernel.org/doc/html/latest/x86/pat.html,
Drivers wanting to export some pages to userspace do it by using mmap interface and a combination of:
pgprot_noncached()
io_remap_pfn_range() or remap_pfn_range() or vmf_insert_pfn()
Note that this set of APIs only works with IO (non RAM) regions. If driver wants to export a RAM region, it has to do set_memory_uc() or set_memory_wc() as step 0 above and also track the usage of those pages and use set_memory_wb() before the page is freed to free pool.
Why is the extra step set_memory_uc() or set_memory_wc() needed for RAM regions?

This is needed since set_memory_uc() and set_memory_wc() are specifically written to work with memory regions; the other API functions you're being told to use here are for I/O regions.
Since you want to work with page(s) in a RAM region using the API functions listed, your driver needs to mark them as uncached or write-combined first so that they can essentially be treated like I/O pages, use the APIs, and then be sure to follow up with explicit writeback(s) of the memory page(s) in order to sync their contents before your driver considers itself "finished" with them.

Related

Can I call dma_map_single() on DeviceB using an addresses returned from dma_alloc_coherent on DeviceA?

I am writing custom linux driver that needs to DMA memory around between multiple PCIE devices. I have the following situation:
I'm using dma_alloc_coherent to allocate memory for DeviceA
I then use DeviceA to fill the memory buffer.
Everything is fine so far but at this point I would like to DMA the
memory to DeviceB and I'm not sure the proper way of doing it.
For now I am calling dma_map_single for DeviceB using the
address returned from dma_alloc_coherent called on DeviceA. This
seems to work fine in x86_64 but it feels like I'm breaking the rules
because:
dma_map_single is supposed to be called with memory allocated from kmalloc ("and friends"). Is it problem being called with an address returned from another device's dma_alloc_coherent call?
If #1 is "ok", then I'm still not sure if it is necessary to call the dma_sync_* functions which are needed for dma_map_single memory. Since the memory was originally allocated from dma_alloc_coherent, it should be uncached memory so I believe the answer is "dma_sync_* calls are not necessary", but I am not sure.
I'm worried that I'm just getting lucky having this work and a future
kernel update will break me since it is unclear if I'm following the API rules correctly.
My code eventually will have to run on ARM and PPC too, so I need to make sure I'm doing things in a platform independent manner instead of getting by with some x86_64 architecture hack.
I'm using this as a reference:
https://www.kernel.org/doc/html/latest/core-api/dma-api.html
dma_alloc_coherent() acts similarly to __get_free_pages() but as size granularity rather page, so no issue I would guess here.
First call dma_mapping_error() after dma_map_single() for any platform specific issue. dma_sync_*() helpers are used by streaming DMA operation to keep device and CPU in sync. At minimum dma_sync_single_for_cpu() is required as device modified buffers access state need to be sync before CPU use it.

Coherent DMA memory on ARM

I'm new to ARM/Linux and there are things that aren't clear to me. ( I might be completely off on this)
I'm trying to get a coherent mem allocated for my device driver (i.e, a region that is non-cached or write-through).
So I attempt to do that with dma_alloc_coherent in Linux.
When I inspect the page table attributes, I notice that I get "Shareable device" memory type.
There are a few memory types regarding the cache policy as in the link below:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0363e/Cacgehgd.html
I was expecting that I would get a non-cacheable or a write-through memory. What is the cache policy of the"Shareable Device" type?? and how does it differ from explicit non-cacheable and write-through memory types??
Actually depending on the ARM architecture release is possible that cached memory regions are coherent after DMA transfers. There is an extension in the AMBA spec (AXI Coherent Extensions) that keeps the coherence of caches memories after another master has performed a transfer, in other words, that after another core or DMA performs a transfer, your cache will have the updated values (or at least the tags are marked as invalid).
It means that, if the kernel of linux is aware of your ARM architecture release it trusts on the coherency mechanism to update caches and thus the pages are marked as shareable.
Please see the issue D of the ACE Protocol Specification on ARM site (registration required) for more information.

Does mlock prevent the page from appearing in a core dump?

I have a process with some sensitive memory which must never be written to disk.
I also have a requirement that I need core dumps to satisfy first-time data capture requirements of my client.
Does locking a page using mlock() prevent the page from appearing in a core dump?
Note, this is an embedded system and we don't actually have any swap space.
Taken from man 2 madvise:
The madvise() system call advises the kernel about how to handle
paging input/output in the address range beginning at address addr and
with size length bytes. It allows an application to tell the kernel
how it expects to use some mapped or shared memory areas, so that the
kernel can choose appropriate read-ahead and caching techniques.
This call does not influence the semantics of the application (except
in the case of MADV_DONTNEED), but may influence its performance. The
kernel is free to ignore the advice.
Particularly check the option MADV_DONTDUMP :
Exclude from a core dump those pages in the range specified by addr
and length. This is useful in applications that have large areas of
memory that are known not to be useful in a core dump. The effect of
MADV_DONTDUMP takes precedence over the bit mask that is set via the
/proc/PID/coredump_filter file (see core(5)).

How different in management page table entries (PTE) in kernel space and user space?

In Linux OS, after enable the page table, kernel will only map PTEs belong to kernel space once and never remap them again ? This action is opposite with PTEs in the user space which needs to remap every time process switching happening ?
So, I want know the difference in management of PTEs in kernel and user space.
This question is a extended part from the question at:
Page table in Linux kernel space during boot
Each process has its own page tables (although the parts that describe the kernel's address space are the same and are shared.)
On a process switch, the CPU is told the address of the new table (this is a single pointer which is written to the CR3 register on x86 CPUs).
So, I want know the difference in management of PTEs in kernel and user space.
See these related questions,
Does Linux use self map for page tables?
Linux Virtual memory
Kernel developer on memory management
Position independent code and shared libraries
There are many optimizations to this,
Each task has a different PGD, but PTE values maybe shared between processes, so large chunks of memory can be mapped the same for each process; only the top-level directory (CR3 on x86, TTB on ARM) is updated.
Also, many CPUs have a TLB and cache. These need to be maintained with the memory mapping. Some caches are VIVT, VIPT and PIPT. The first two have to have some cache flushing iff the PGD and/or PTE change. Often a CPU will support a process, thread or domain id. The OS only needs to switch this register during a context switch. The hardware cache and TLB entries must contains tags with the process, thread, or domain id. This is an implementation detail for each architecture.
So it is possible that TLB flushes could be needed when a top level page registers changes. The CPU could flush the entire TLB when this happens. However, this would be a disadvantage to pages that remain mapped.
Also, sub-sections of memory can be the same. A loader or other library can use mmap to create code that is similar between processes. This common code may not need to be swapped at the page table level, depending on architecture, loader and Linux version. It could of course have a virtual alias and then it needs to be swapped.
And the final point to the answer; kernel pages are always mapped. Only a non-preemptive OS could not map the kernel, but that would make little sense as every process wants to call the kernel. I guess the micro-kernel paradigm allows for device drivers to unload when they are not in use. Linux uses module loading to handle this.

how to allocate non-cacheable physical memory in kernel?

if want to allocate non-cacheable physical memory (DRAM) for usage in the driver,
(ie. don't want the data being cached into the CPU's data cache when
the data are accessed) how could I do this?
there are functions like kmalloc(), get_free_pages, vmalloc, etc,
but seems like that I can't specify if the data can be cached or not using these functions?
any suggestion on how to do it?
thanks!
In short there is no easy way to do this, it is very platform dependent.
If you want a go at it read drivers/char/mem.c and Chapter 15 of the Linux Device Drivers 3rd Edition book.

Resources