dma_alloc_coherent failed on x86_64 but works on i686 - linux-kernel

I have a driver for pci device which uses CMA allocation mechanism for DMA allocations. It works fine on kernel 3.18 in 32bit mode but when I try to use it in 64 kernel(same config as in 32bit, but switched on 64bit mode) dma allocation failed.
Only thing I see in dmesg is:
fallback device: swiotlb buffer is full (sz: 8388608 bytes)
I use kernel cmdline:
swiotlb=16384 iommu=soft cma=256M
and allocating 8Mb.
The function call is:
new_region->kaddr = dma_alloc_coherent( NULL, size, &new_region->paddr, GFP_KERNEL | GFP_DMA32 );
Can someone explain this behaviour in 64bit mode?

After more investigation, I think you may have the same rootcause as me, just list it out for your reference.
The CMA allocation will start from the highest memory blocks, so if you have more that 3G memory, the last physical memory will be above 0xFFFFFFFF, that means the CMA's base address is above 4GB, but dma_allocat_coherent() requires the address is below the mask [(0x1 << 32)-1] = 0xFFFFFFFF, if the alloacted dma end address is bigger than 0xFFFFFFFF, it will fallback to the swiotlb buffer and then you will see the error you described. Please see the below memory map, the last 2G is above 4GB memory space.
To resolve this problem, we can specify the cma start address, size and limit size to manually control the CMA reserved location.
e820: BIOS-provided physical RAM map:
BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff] usable
BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reserved
BIOS-e820: [mem 0x0000000000100000-0x00000000779c4fff] usable
BIOS-e820: [mem 0x00000000779c5000-0x0000000077a45fff] reserved
BIOS-e820: [mem 0x0000000077a46000-0x0000000079426fff] usable
BIOS-e820: [mem 0x0000000079427000-0x000000007b32efff] reserved
BIOS-e820: [mem 0x000000007b32f000-0x000000007b985fff] ACPI NVS
BIOS-e820: [mem 0x000000007b986000-0x000000007bad3fff] ACPI data
BIOS-e820: [mem 0x000000007bad4000-0x000000007bafffff] usable
BIOS-e820: [mem 0x000000007bb00000-0x000000008fffffff] reserved
BIOS-e820: [mem 0x00000000fed1c000-0x00000000fed1ffff] reserved
BIOS-e820: [mem 0x0000000100000000-0x000000017fffffff] usable
cma=nn[MG]#[start[MG][-end[MG]]]
[ARM,X86,KNL]
Sets the size of kernel global memory area for
contiguous memory allocations and optionally the
placement constraint by the physical address range of
memory allocations. A value of 0 disables CMA
altogether. For more information, see
include/linux/dma-contiguous.h
To give you a workaround first, please add "mem=3072M" to your kernel
command line. Then it will work in your case. The reason why you see
this may be related to the NULL passed to the function
dma_alloc_coherent(). It will use the x86_dma_fallback_dev by default.
Will update more once I get more information.

Related

Why 4-level paging can only cover 64 TiB of physical address

There are the words in linux/Documentation/x86/x86_64/5level-paging.rst
Original x86-64 was limited by 4-level paging to 256 TiB of virtual address space and 64 TiB of physical address space.
I know that the limit of virtual address is 256TB because 2^48 = 256TB. But I don't know why its limit of physical is only 64TB.
Suppose we set the size of each page to 4k. Thus a linear address has 12 bits of offset, 9 bits indicate the index in each four level, which means 512 entries per level. A linear address can cover 512^4 pages, 512^4 * 4k = 256TB of space.
This is my understanding of the calculation of space limit. I'm wondering what's wrong with it.
The x86-64 ISA's physical address space limit is unchanged by PML5, remaining at 52-bit. Real CPUs implement some narrower number of physical address bits, saving bits in cache tags and TLB entries, among other places.
The 64 TiB limit is not imposed by x86-64 itself, but by the way Linux requires more virtual address space than physical for its own convenience and efficiency. See x86_64/mm.txt for the actual layout of Linux's virtual address space on x86-64 with PML4 paging, and note the 64 TB "direct mapping of all physical memory (page_offset_base)"
x86-64 Linux doesn't do HIGHMEM / LOWMEM
Linux can't actually use more than phys mem = 1/4 virtual address space, without nasty HIGHMEM / LOWMEM stuff like in the bad old days of 32-bit kernels on machines with more than 1 GiB of RAM (vm/highmem.html). (With a 3:1 user:kernel split of address space, letting user-space have 3GiB, but with the kernel having to map pages in/out of its own space if not accessing them via the current process's user-space addresses.)
Linus's rant about 32-bit PAE expands on why it's nice for an OS to have enough virtual address space to keep everything mapped, with the usual assertion that people who don't agree with him are morons. :P I tend to agree with him on this, that there are obvious efficiency advantages and that PAE is a huge PITA for the kernel. Probably even moreso on an SMP system.
If anyone had proposed a patch to add highmem support for x86-64 to allow using more than 64 TiB of physical memory with the existing PML4 format, I'd expect Linus would tell them 1995 called and wants its bad idea back. He wouldn't consider merging such a patch unless much RAM became common for servers, but hardware vendors still hadn't provided an extension for wider virtual addresses.
Fortunately that didn't happen: probably no CPU has supported wider than 46-bit phys addrs without supporting PML5. Vendors know that supporting more RAM than mainline Linux can use wouldn't be a selling point. But as the doc said, commercial systems were getting up to a max capacity of 64 TiB.
x86-64's page-table format has room for 52-bit physical addresses
The x86-64 page-table format itself has always had that much room: Why in x86-64 the virtual address are 4 bits shorter than physical (48 bits vs. 52 long)? has diagrams from AMD's manuals. Of course early CPUs had narrower physical addresses so you couldn't for example have a PCIe device put its device memory way up high in physical address space.
Your calculation has nothing to do with physical address limits, which is set by the number of bits in each page-table entry that can be used for that.
In x86-64 (and PAE), the page table format reserves bits up to bit #51 for use as physical-address bits, so OSes must zero them for forward compatibility with future CPUs. The low 12 bits are used for other things, but the physical address is formed by zeroing out the bits other than the phys-address bits in the PTE, so those low 12 bits become the low zero bits in an aligned physical-page address.
x86 terminology note: logical addresses are seg:off, and segment_base + offset gives you a linear address. With paging enabled (as required in long mode), linear addresses are virtual, and are what's used as a search key for the page tables (effectively a radix tree cached by the TLB).
Your calculation is just correctly reiterating the 256 TiB size of virtual address space, based on 4-level page tables with 4k pages. That's how much memory can be simultaneously mapped with PML4.
A physical page has to be the same size as a virtual page, and in x86-64 yes that's 4 KiB. (Or 2M largepage or 1G hugepage).
Fun fact: the x86-64 page-table-entry format is the same as PAE, so modern CPUs can also access large amounts of memory 32-bit mode. But of course not map it all at once. It's probably not a coincidence that AMD chose to use an existing well-designed format when making AMD64, so their CPUs would only need two different modes for hardware page-table walker: legacy x86 with 4-byte PTEs (10 bits per level) and PAE/AMD64 with 8-byte PTEs (9 bits per level).

How is the 34 bit physical address space accessed in a RISC-V 32 bit system when virtual memory is disabled?

In the RISC-V 32 bit ISA, the physical address space is 34 bit with a 32 bit virtual address space. When virtual memory is enabled in supervisor mode the 32 bit virtual address is translated by accessing the page table, yielding a 34 bit physical address. When virtual memory is disabled however, the 32 bit addresses still must be converted to a 34 bit physical address. In the RISC-V privileged ISA specification in section 4.1.12 it states:
When MODE=Bare,supervisor virtual addresses are equal to supervisor physical addresses
So, my question is: does this mean that only the low 4GB (bottom 32 bits) of memory are able to be accessed in supervisor mode with virtual memory disabled? If so, then how is the rest of the 16 GB (34 bit) physical memory supposed to be accessed in supervisor mode when virtual memory is disabled?
SV32 Virtual and Physical Addressing
Someone asked a similar question in an issue on the Github repo for the ISA manual. It appears to be the case that when running with MODE=Bare with RV32, you can only access the bottom 4GiB of the 34-bit physical address space, and the top 12GiB are inaccessible. The 32-bit register values are zero-extended into 34-bit physical addresses.
While this isn't explicitly stated in the manual, it does say in the caption for Figure 4.17 in the Privileged ISA spec that “when mapping between narrower and wider addresses, RISC-V usually zero-extends a narrower address to a wider size.”

Is there still HIGHMEM allocation in x86_64?

With x86 32-bit virtual address space and lower physical memory mapped continuousely after kernel at 0xc0000000 the upper physical memory part needed to be mapped into the virtual address space dynamically.
Has this changed in the x86_64 kernel?
Is there still HIGHMEM allocation or is all phyical memory in x86_64 accessible with simple physical to virtual address translation macro?
No. The highmem comes from ZONE_DMA、ZONE_NORMAL and ZONE_HIGHMEM. But in 64, cause it's really huge, we split the kernel spaces into several part with large holes between them for safe, and there are nothing called high memory there. You can read this for more detail about the structure of x64 kernel address.
I found this one:
https://www.kernel.org/doc/Documentation/x86/x86_64/mm.txt
ff11000000000000 | -59.75 PB | ff90ffffffffffff | 32 PB | direct mapping of all physical memory (page_offset_base)

How to read 32-bit PCI bar memory in 64-bit linux kernel environment

I want to do I/O on my PCIe device.I am running Ubuntu 16.0.4 LTS with linux kernel 4.4.0
The output of lspci -v command is:
06:00.0 Unclassified device [00ff]: Device 1aa1:2000 (rev 01)
Subsystem: Device 1aa1:2000
Physical Slot: 1-4
Flags: bus master, fast devsel, latency 0, IRQ 16
Memory at f1008000 (32-bit, non-prefetchable) [size=8K]
Memory at ee000000 (32-bit, non-prefetchable) [size=32M]
Memory at f100a000 (32-bit, non-prefetchable) [size=4K]
Memory at f0000000 (32-bit, non-prefetchable) [size=16M]
Memory at f1000000 (32-bit, non-prefetchable) [size=32K]
Capabilities: <access denied>
Kernel driver in use: my_pci
Kernel modules: my_pci
Clearly, PCI addresses are 32-bit.
I want to know how to use ioread32/iowrite32 functions to read/write into the BAR addresses.
unsigned char __iomem *mem types on my machine would be 64-bit and if I use the following say :
ioread32(mem + some_offset);
The expression mem + some_offset would be 64-bit and result into crash.
How would I do the I/O ?
The PCI devices you're working with works using 32bit addressing mode.
when your PC enumerates the BARs and writes the physical address onto the BAR. it writes a masked values, only the lower 32 bit of the 64 bit (in host address space)
Print the physical address the OS/BIOS has assigned to this BAR on the driver and compare it.
besides, this is a physical address, so you can't iowrite to it anyway.
So I don't really understand your goal.

addresability vs address space vs address bus

How do you determine addressability based on address space? How do you determine the size of the address bus based on the addressability? Ex. The addressability of a machine is 32 bits, what is the size of the address bus?
The address bus connects the CPU with the main memory. So if the address bus has 32 bits, the max size of the main memory is 2^32 bytes, i.e. 4 GB.
The address bus transfers a physical address, and thus the physical address space in this example is 4 GB.
However the CPU generates virtual addresses, and the virtual addresses are the virtual address space. The virtual addresses have to be mapped to physical addresses by a memory management unit.
In principle, one can map a small virtual address space to a large physical one (as done earlier e.g. in the PDP11 computers), but nowadays mostly a larger virtual address space is mapped to a smaller physical one, e.g. from a 64-bit CPU with a 2^64 byte virtual address space to a physical memory with a 32-bit address bus, which is thus 4 GB large.
So if you have a primitive system without memory management, and you want that all addresses that the GPU can generate are existing main memory addresses, then you address bus must have the same number of bits as the CPU uses for addressing, e.g. 32 bits.
But in a real system the virtual CPU addresses are essentially independent from the physical memory addresses.

Resources