I am reading memory-barrier.txt file as mentioned below.
Please clarify my doubt .
1) For example if CPU1 got the lock , How PCI bridge can see STORE
*ADDR = 4 before STORE *DATA = 1?
ACQUIRES VS I/O ACCESSES
Under certain circumstances (especially involving NUMA), I/O accesses within
two spinlocked sections on two different CPUs may be seen as interleaved by the
PCI bridge, because the PCI bridge does not necessarily participate in the
cache-coherence protocol, and is therefore incapable of issuing the required
read memory barriers.
For example:
CPU 1
===============================
spin_lock(Q)
writel(0, ADDR)
writel(1, DATA);
spin_unlock(Q);
CPU 2
===============================
spin_lock(Q);
writel(4, ADDR);
writel(5, DATA);
spin_unlock(Q);
may be seen by the PCI bridge as follows:
STORE *ADDR = 0, STORE *ADDR = 4, STORE *DATA = 1, STORE *DATA = 5
which would probably cause the hardware to malfunction.
in 3.19 kernel , writel() internally has hardware barrier call. mmiowb() call is replaced with nothing in arm and compiler barrier in x86 . Spin unlock internally provided compilation barrier. So documentation at kernel.org/doc/Documentation/memory-barriers.txt needs to be corrected
Related
__shared__ float smem[2];
smem[0] = global_memory[0];
smem[1] = global_memory[1];
/*process smem[0]...*/
/*process smem[1]...*/
My question is, does smem[1] = global_memory[1]; block computation on smem[0]?
In Cuda thread scheduling - latency hiding and Cuda global memory load and store they say memory read will not stall the thread, untill the read data is being used. does storing it to shared memory count as "using the data"? should I do something like this:
__shared__ float smem[2];
float a = global_memory[0];
float b = global_memory[1];
smem[0] = a;
/* process smem[0]*/
smem[1] = b;
/* process smem[1]*/
Or perhaps compiler does it for me? but then does it use extra registers?
Yes, in the general case this would block the CUDA thread:
smem[0] = global_memory[0];
the reason is that this operation would be broken into two steps:
LDG Rx, [Ry]
STS [Rz], Rx
The first SASS instruction loads from global memory. This operation does not block the CUDA thread. It can be issued to the LD/ST unit, and the thread can continue. However the register target of that operation (Rx) is tracked, and if any instruction needs to use the value from Rx, the CUDA thread will stall at that point.
Of course the very next instruction is the STS (store shared) instruction that will use the value from Rx, so the CUDA thread will stall at that point (until the global load is satisfied).
Of course it's possible that the compiler may reorder the instructions so that the STS instruction occurs later, but there is no guarantee of that. Regardless, whenever the STS instruction is ordered by the compiler, the CUDA thread will stall at that point, until the global load is completed. For the example you have given, I think its quite likely that the compiler would create code that looks like this:
LDG Rx, [Ry]
LDG Rw, [Ry+1]
STS [Rz], Rx
STS [Rz+1], Rw
In other words, I think its likely that the compiler would organize these loads such that both global loads could be issued, before a possible stall occurs. However, there is no guarantee of this, and the specific behavior for your code can only be deduced by studying the actual SASS, but in the general case we should assume the possibility of a thread stall.
Yes, if you can break up the loads and stores as you have shown in your code, then this operation:
float b = global_memory[1];
should not block this operation:
smem[0] = a;
/* process smem[0]*/
Having said all that, CUDA introduced a new mechanism to address this scenario in CUDA 11, supported by devices of compute capability 8.0 and higher (so, all Ampere GPUs at this time). This new feature is referred to as asynchronous copy of data from global to shared memory. It allows for these copy operations to proceed without stalling CUDA threads. However this feature requires proper use of a barrier to make sure that when you need to actually use the data in shared memory, it is present.
I was wondering if there is an existing system call/API for accessing getting the physical address of the virtual address?
If there is none then some direction on how to get that working ?
Also, how to get the physical address of MMIO which is non-pageable physical memory ?
The answer lies in IOMemoryDescriptor and IODMACommand objects.
If the memory in question is kernel-allocated, it should be allocated by creating an IOBufferMemoryDescriptor in the first place. If that's not possible, or if it's a buffer allocated in user space, you can wrap the relevant pointer using IOMemoryDescriptor::withAddressRange(address, length, options, task) or one of the other factory functions. In the case of withAddressRange, the address passed in must be virtual, in the address space of task.
You can directly grab physical address ranges from an IOMemoryDescriptor by calling the getPhysicalSegment() function (only valid between prepare()…complete() calls). However, normally you would do this for creating scatter-gather lists (DMA), and for this purpose Apple strongly recommends the IODMACommand. You can create these using IODMACommand::withSpecification(). Then use the genIOVMSegments() function to generate the scatter-gather list.
Modern Macs, and also some old PPC G5s contain an IOMMU (Intel calls this VT-d), so the system memory addresses you pass to PCI/Thunderbolt devices are not in fact physical, but IO-Mapped. IODMACommand will do this for you, as long as you use the "system mapper" (the default) and set mappingOptions to kMapped. If you're preparing addresses for the CPU, not a device, you will want to turn off mapping - use kIOMemoryMapperNone in your IOMemoryDescriptor options. Depending on what exactly you're trying to do, you probably don't need IODMACommand in this case either.
Note: it's often wise to pool and reuse your IODMACommand objects, rather than freeing and reallocating them.
Regarding MMIO, I assume you mean PCI BARs and similar - for IOPCIDevice, you can grab an IOMemoryDescriptor representing the memory-mapped device range using getDeviceMemoryWithRegister() and similar functions.
Example:
If all you want are pure CPU-space physical addresses for a given virtual memory range in some task, you can do something like this (untested as a complete kext that uses it would be rather large):
// INPUTS:
mach_vm_address_t virtual_range_start = …; // start address of virtual memory
mach_vm_size_t virtual_range_size_bytes = …; // number of bytes in range
task_t task = …; // Task object of process in which the virtual memory address is mapped
IOOptionBits direction = kIODirectionInOut; // whether the memory will be written or read, or both during the operation
IOOptionBits options =
kIOMemoryMapperNone // we want raw physical addresses, not IO-mapped
| direction;
// Process for getting physical addresses:
IOMemoryDescriptor* md = IOMemoryDescriptor::withAddressRange(
virtual_range_start, virtual_range_size_bytes, direction, task);
// TODO: check for md == nullptr
// Wire down virtual range to specific physical pages
IOReturn result = md->prepare(direction);
// TODO: do error handling
IOByteCount offset = 0;
while (offset < virtual_range_size_bytes)
{
IOByteCount segment_len = 0;
addr64_t phys_addr = md->getPhysicalSegment(offset, &len, kIOMemoryMapperNone);
// TODO: do something with physical range of segment_len bytes at address phys_addr here
offset += segment_len;
}
/* Unwire. Call this only once you're done with the physical ranges
* as the pager can change the physical-virtual mapping outside of
* prepare…complete blocks. */
md->complete(direction);
md->release();
As explained above, this is not suitable for generating DMA scatter-gather lists for device I/O. Note also this code is only valid for 64-bit kernels. You'll need to be careful if you still need to support ancient 32-bit kernels (OS X 10.7 and earlier) because virtual and physical addresses can still be 64-bit (64-bit user processes and PAE, respectively), but not all memory descriptor functions are set up for that. There are 64-bit-safe variants available to be used for 32-bit kexts.
Memory in the Linux kernel is usually unswappable (Do Kernel pages get swapped out?). However, sometimes it is useful to allow memory to be swapped out. Is it possible to explicitly allocate swappable memory inside the Linux kernel? One method I thought of was to create a user space process and use its memory. Is there anything better?
You can create a file in the internal shm shared memory filesystem.
const char *name = "example";
loff_t size = PAGE_SIZE;
unsigned long flags = 0;
struct file *filp = shmem_file_setup(name, size, flags);
/* assert(!IS_ERR(filp)); */
The file isn't actually linked, so the name isn't visible. The flags may include VM_NORESERVE to skip accounting up-front, instead accounting as pages are allocated. Now you have a shmem file. You can map a page like so:
struct address_space *mapping = filp->f_mapping;
pgoff_t index = 0;
struct page *p = shmem_read_mapping_page(mapping, index);
/* assert(!IS_ERR(filp)); */
void *data = page_to_virt(p);
memset(data, 0, PAGE_SIZE);
There is also shmem_read_mapping_page_gfp(..., gfp_t) to specify how the page is allocated. Don't forget to put the page back when you're done with it.
put_page(p);
Ditto with the file.
fput(filp);
Answer to your question is a simple No, or Yes with a complex modification to kernel source.
First, to enable swapping out, you have to ask yourself what is happening when kswapd is swapping out. Essentially it will walk through all the processes and make a decision whether its memory can be swapped out or not. And all these memory have the hardware mode of ring 3. So SMAP essentially forbid it from being read as data or executed as program in the kernel (ring 0):
https://en.wikipedia.org/wiki/Supervisor_Mode_Access_Prevention
And check your distros "CONFIG_X86_SMAP", for mine Ubuntu it is default to "y" which is the case for past few years.
But if you keep your memory as a kernel address (ring 0), then you may need to consider changing the kswapd operation to trigger swapout of kernel addresses. Whick kernel addresses to walk first? And what if the address is part of the kswapd's kernel operation? The complexities involved is huge.
And next is to consider the swap in operation: When the memory read is attempted and it's "not present" bit is enabled, then hardware exception will trigger linux kernel memory fault handler (which is __do_page_fault()).
And looking into __do_page_fault:
https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/fault.c#L1477
and there after how it handler the kernel addresses (do_kern_address_fault()):
https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/fault.c#L1174
which essentially is just reporting as error for possible scenario. If you want to enable kernel address pagefaulting, then this path has to be modified.
And note too that the SMAP check (inside smap_violation) is done in the user address pagefaulting (do_usr_addr_fault()).
I was reading section 'Part Id' of the following document I'm not sure how relevant this document to kernel 2.6.35 for instance; specifically it says:
..the DMA address of the memory must be within the dma_mask of the device..
and they recommend to pass certain flags, such as GFP_DMA, to kmalloc, so that it ensures the memory will fall within DMA mask provided.
However if the memory is allocated from cache pool created by kmem_cache_create, and with kmem_cache_alloc(.. GFP_ATOMIC), this doesn't meet requirements outlined in DMA-API.txt ?
On the other hand, LDD talks about __GFP_DMA flag with regard to legacy ISA devices, therefore I'm not sure this is applicable to PCI/PCIe devices.
This is x86 64-bit platform if it matters:
pci_set_dma_mask(dev, 0xffffffffffffffffULL);
pci_set_consistent_dma_mask(dev, 0xffffffffffffffffULL);
I would appreciate to hear some explanations on it.
For GFP_* for DMA
On x86:
ISA - when using kmalloc() need to bitwise-or GFP_DMA with GFP_KERNEL (or _ATOMIC) because of the following:
GFP_DMA guarantees:
(1) physical addresses are consecutive when get_free_page returns more than one page and
(2) only addresses lower than MAX_DMA_ADDRESS are returned. MAX_DMA_ADDRESS is 16MB on the PC because of ISA constraings
PCI - don't need to use GFP_DMA because there is no MAX_DMA_ADDRESS limit
The dma_mask is checked by the device when calling dma_map_* or dma_alloc_coherent.
dma_alloc_coherent ensures the memory allocated is able to be used by dma_map_* which gives other benifits too. (the implementation may choose to ignore flags that affect the location of the returned memory, like GFP_DMA)
You can refer to http://coweb.cc.gatech.edu/sysHackfest/uploads/58/DMA_howto.1.txt
I'm trying to use monitor/mwait instructions to monitor DMA writes from a device to a memory location. In a kernel module (char device) I have the following code (very similar to this piece of kernel code) that runs in a kernel thread:
static int do_monitor(void *arg)
{
struct page *p = arg; // p is a 'struct page *'; it's also remapped to user space
uint32_t *location_p = phys_to_virt(page_to_phys(p));
uint32_t prev = 0;
int i = 0;
while (i++ < 20) // to avoid infinite loop
{
if (*location_p == prev)
{
__monitor(location_p, 0, 0);
if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
clflush(location_p);
if (*location_p == prev)
__mwait(0, 0);
}
prev = *location_p;
printk(KERN_NOTICE "%d", prev);
}
}
In user space I have the following test code:
int fd = open("/dev/mon_test_dev", O_RDWR);
unsigned char *mapped = (unsigned char *)mmap(0, mmap_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
for (int i = 1; i <= 5; ++i)
*mapped = i;
munmap(mapped, mmap_size);
close(fd);
And the kernel log looks like this:
1
2
3
4
5
5
5
5
5
5
5 5 5 5 5 5 5 5 5 5
I.e. it seems that mwait doesn't wait at all.
What could be the reason?
The definition of MONITOR/MWAIT semantics does not specify explicitly whether DMA transactions may or may not trigger it. It is supposed that triggering happens for logical processor's stores.
Current descriptions of MONITOR and MWAIT in the Intel's official Software Developer Manual are quite vague to that respect. However, there are two clauses in the MONITOR section that caught my attention:
The content of EAX is an effective address (in 64-bit mode, RAX is used). By default, the DS segment is used to create a linear address that is monitored.
The address range must use memory of the write-back type. Only write-back memory will correctly trigger the
monitoring hardware.
The first clause states that MONITOR is meant to be used with linear addresses, not physical ones. Devices and their DMA are meant to work with physical addresses only. So basically this means that all agents relying on the same MONITOR range should operate in the same domain of virtual memory space.
The second clause requires the monitored memory region to be cacheable (write-back, WB). For DMA, respective memory range is usually has to be marked as uncacheable, or write-combining at best (UC or WC). This is even a stronger indicator that your intent to use MONITOR/MWAIT to be triggered by DMA is very unlikely to work on current hardware.
Considering your high-level goal - to be able to tell when a device has written to given memory range - I cannot remember any robust method to achieve it, besides using virtualization for devices (VTd, IOMMU etc.) Basically, the classic approach for a peripheral device is to issue an interrupt when it is done with writing to memory. Until an interrupt arrives, there is no way for CPU to tell if all DMA bytes have successfully reached their destination in memory.
Device virtualization allows to abstract physical addresses from a device in a transparent manner, and have an equivalent of a page fault when it attempts to write/read from memory.