Coherent DMA memory on ARM

Coherent DMA memory on ARM - caching

I'm new to ARM/Linux and there are things that aren't clear to me. ( I might be completely off on this)
I'm trying to get a coherent mem allocated for my device driver (i.e, a region that is non-cached or write-through).
So I attempt to do that with dma_alloc_coherent in Linux.
When I inspect the page table attributes, I notice that I get "Shareable device" memory type.
There are a few memory types regarding the cache policy as in the link below:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0363e/Cacgehgd.html
I was expecting that I would get a non-cacheable or a write-through memory. What is the cache policy of the"Shareable Device" type?? and how does it differ from explicit non-cacheable and write-through memory types??

Actually depending on the ARM architecture release is possible that cached memory regions are coherent after DMA transfers. There is an extension in the AMBA spec (AXI Coherent Extensions) that keeps the coherence of caches memories after another master has performed a transfer, in other words, that after another core or DMA performs a transfer, your cache will have the updated values (or at least the tags are marked as invalid).
It means that, if the kernel of linux is aware of your ARM architecture release it trusts on the coherency mechanism to update caches and thus the pages are marked as shareable.
Please see the issue D of the ACE Protocol Specification on ARM site (registration required) for more information.

Related

AArch64 memory synchronization operations on multiply-mapped addresses

Suppose I have two pages that map to the same physical memory. Would an acquire operation (or fence) on a virtual address in one page properly synchronize with a release operation (or fence) on a virtual address in the other? Secondly, would cache maintenance operations (dc, ic), too, work with such multiply-mapped memory?
In other words...
...would a stlr (or dmb ishst if fence) on one core to one page properly synchronize with ldar (or dmb ishld if fence) on another core to the other page?
...would a dc whatever on one virtual address have the same effect as a dc whatever on the other?

As to memory ordering, yes, this is fine. The ARMv8 memory model is defined in terms of reads and writes of a Location, which is defined as "a byte that is associated with an address in the physical address space". See B2.3.1 in the Architecture Reference Manual, version H.a. (Older versions left out the "physical" part so it seems someone noticed that this was ambiguous.)
Likewise, an exclusive load ldxr says in the manual that it marks the physical address as an exclusive access.
Note that if this weren't the case, then on typical OSes, shared memory between processes (e.g. shmget, mmap(MAP_SHARED), etc) would be unusable, as the shared mappings are normally at different virtual addresses in the different processes.
I can't answer the part about cache right now.

Mmap a writecombine region to userspace documentation

According to the following documentation
https://www.kernel.org/doc/html/latest/x86/pat.html,
Drivers wanting to export some pages to userspace do it by using mmap interface and a combination of:
pgprot_noncached()
io_remap_pfn_range() or remap_pfn_range() or vmf_insert_pfn()
Note that this set of APIs only works with IO (non RAM) regions. If driver wants to export a RAM region, it has to do set_memory_uc() or set_memory_wc() as step 0 above and also track the usage of those pages and use set_memory_wb() before the page is freed to free pool.
Why is the extra step set_memory_uc() or set_memory_wc() needed for RAM regions?

This is needed since set_memory_uc() and set_memory_wc() are specifically written to work with memory regions; the other API functions you're being told to use here are for I/O regions.
Since you want to work with page(s) in a RAM region using the API functions listed, your driver needs to mark them as uncached or write-combined first so that they can essentially be treated like I/O pages, use the APIs, and then be sure to follow up with explicit writeback(s) of the memory page(s) in order to sync their contents before your driver considers itself "finished" with them.

ensure the DMA -capable memory

I was reading section 'Part Id' of the following document I'm not sure how relevant this document to kernel 2.6.35 for instance; specifically it says:
..the DMA address of the memory must be within the dma_mask of the device..
and they recommend to pass certain flags, such as GFP_DMA, to kmalloc, so that it ensures the memory will fall within DMA mask provided.
However if the memory is allocated from cache pool created by kmem_cache_create, and with kmem_cache_alloc(.. GFP_ATOMIC), this doesn't meet requirements outlined in DMA-API.txt ?
On the other hand, LDD talks about __GFP_DMA flag with regard to legacy ISA devices, therefore I'm not sure this is applicable to PCI/PCIe devices.
This is x86 64-bit platform if it matters:
pci_set_dma_mask(dev, 0xffffffffffffffffULL);
pci_set_consistent_dma_mask(dev, 0xffffffffffffffffULL);
I would appreciate to hear some explanations on it.

For GFP_* for DMA
On x86:
ISA - when using kmalloc() need to bitwise-or GFP_DMA with GFP_KERNEL (or _ATOMIC) because of the following:
GFP_DMA guarantees:
(1) physical addresses are consecutive when get_free_page returns more than one page and
(2) only addresses lower than MAX_DMA_ADDRESS are returned. MAX_DMA_ADDRESS is 16MB on the PC because of ISA constraings
PCI - don't need to use GFP_DMA because there is no MAX_DMA_ADDRESS limit
The dma_mask is checked by the device when calling dma_map_* or dma_alloc_coherent.
dma_alloc_coherent ensures the memory allocated is able to be used by dma_map_* which gives other benifits too. (the implementation may choose to ignore flags that affect the location of the returned memory, like GFP_DMA)
You can refer to http://coweb.cc.gatech.edu/sysHackfest/uploads/58/DMA_howto.1.txt

Unknown symbol flush_cache_range in linux device driver

I am just writing my very first linux device driver, and I have ran into a problem. I want to prevent one memory region from being cached, so I have been trying to use flush_cache_range() and flush_tlb_range() to flush the cache for this memory region. Everything compiles well, but when I try to load the kernel module I get the following errors:
Unknown symbol flush_cache_range (err 0)
Unknown symbol flush_tlb_range (err 0)
I find this very strange. Shouldn't they be defined in kernel?
I know that alternatively I could also use dma_alloc_coherent() to allocate a non-cached memory region. But I don't have a device structure and passing NULL for this parameter didn't cause any errors, but I also couldn't see any of the data that was supposed to be there.
Some information about my system: I'm trying to get this running on a ARM microcontroller with an integrated FPGA (the Xilinx Zynq). The FPGA copies some data to a memory location specified by the CPU. Now I want to access this memory without getting old data from the caches.
Any help is very appreciated.

You cannot use functions such as flush_cache_range() because they are not intended to be used by modules.
To allocate memory that can be accessed by a DMA device, you must use dma_alloc_coherent().
This requires a valid device structure so that it can do proper mapping between memory addresses and bus addresses.
If your device is not on a bus that is handled by an existing framework (such as PCI), you have to create a platform device.

A few notes:
1- flush_cache_range doesn't "prevent one memory region from being cached" .. It just simply flush (clean + invalidate) the caches. Any future writes/reads to this memory region through the same virtual range will go through the cache again.
2- If the FPGA is writing to memory and then the CPU are going to read from this memory, probably flushing the cache isn't the correct thing to do any way. Usually what you need to do is to invalidate the memory region and then tell the FPGA to write.
3- Please take a look at "${kernel-src}/Documentation/DMA-API.txt" in the kernel sources. It has plenty of information about how you can safely ( cache maintenance + phys_to_dma translation ) use a specific region of memory for DMA.

How different in management page table entries (PTE) in kernel space and user space?

In Linux OS, after enable the page table, kernel will only map PTEs belong to kernel space once and never remap them again ? This action is opposite with PTEs in the user space which needs to remap every time process switching happening ?
So, I want know the difference in management of PTEs in kernel and user space.
This question is a extended part from the question at:
Page table in Linux kernel space during boot

Each process has its own page tables (although the parts that describe the kernel's address space are the same and are shared.)
On a process switch, the CPU is told the address of the new table (this is a single pointer which is written to the CR3 register on x86 CPUs).

So, I want know the difference in management of PTEs in kernel and user space.
See these related questions,
Does Linux use self map for page tables?
Linux Virtual memory
Kernel developer on memory management
Position independent code and shared libraries
There are many optimizations to this,
Each task has a different PGD, but PTE values maybe shared between processes, so large chunks of memory can be mapped the same for each process; only the top-level directory (CR3 on x86, TTB on ARM) is updated.
Also, many CPUs have a TLB and cache. These need to be maintained with the memory mapping. Some caches are VIVT, VIPT and PIPT. The first two have to have some cache flushing iff the PGD and/or PTE change. Often a CPU will support a process, thread or domain id. The OS only needs to switch this register during a context switch. The hardware cache and TLB entries must contains tags with the process, thread, or domain id. This is an implementation detail for each architecture.
So it is possible that TLB flushes could be needed when a top level page registers changes. The CPU could flush the entire TLB when this happens. However, this would be a disadvantage to pages that remain mapped.
Also, sub-sections of memory can be the same. A loader or other library can use mmap to create code that is similar between processes. This common code may not need to be swapped at the page table level, depending on architecture, loader and Linux version. It could of course have a virtual alias and then it needs to be swapped.
And the final point to the answer; kernel pages are always mapped. Only a non-preemptive OS could not map the kernel, but that would make little sense as every process wants to call the kernel. I guess the micro-kernel paradigm allows for device drivers to unload when they are not in use. Linux uses module loading to handle this.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio