Issue Getting Physical Address in Kernel Module - Can't MMAP to that address - memory-management

I am implementing DMA through memory mapped registers and need the physical address of the buffers used to pass on to the DMA device.
I allocate my memory using a kernel module, then pass the physical address on to the userspace application driving the DMA. I have tried with and without the Page Shift.
float* buffer = (float*) kcalloc(512, sizeof(float), GFP_DMA); //allocate physical memory for DMA
unsigned long physAddr = virt_to_phys(buffer) >> PAGE_SHIFT;
When I write these addrs to my DMA device no data moves - and when I attempt to mmap to these addrs from /dev/mem they result in invalid pointers (segfaults on read).
How should I be getting these addresses for the allocated memory?

Related

Linux Kernel Driver - physical CPU memory not updated. DMA problem

I'm using Orange Pi3 LTS with Allwinner H6 ARM CPU. I'm writing now UART driver with DMA for Rx and Tx. I allocated physical RAM memory using kmalloc() call and I got physical and logical address for my allocated memory. So, I know physical address in processor and corresponding logical address in Linux Kernel Driver space. I have a problem with updating physical memory after update logical. I mean, for example in my linux kernel driver I have callback init() when I'm attaching my driver to kernel and exit() when I'm disconnecting driver from kernel. In this call init() I'm allocating physical memory using kmalloc() call. In the same call I'm filling this memory with some data, but using logical address (because from kernel I can't access physical memory). In the same call (after fill memory) I'm triggering one of DMA channel to do job (I'm putting data to CPU registers). So, DMA should take descriptor (as pointer) from physical RAM memory and do some job for transmit data over UART. But it seems that physical memory is not updated in this "init()" call. Only logical RAM memory is updated, because in CPU registers I have wrong data. But when I put filling in RAM only descriptor data and for example in another kernel callback (exit) I'm triggering DMA then it is working -> in physical RAM memory is correct data and data is sending over UART as expected. I don't understand this situation. Why in single linux kernel driver callback (i.e. "init") physical memory is not updated, but it is updated only in logical memory space. Why linux kernel driver is not updating physical memory (over MMU) directly after write to logical memory, but after this call (after leave init() callbcak)?
As I wrote in problem description.
I studied documentation about DMA API Linux. Finally I found solution.
As was wrote in comment here was a problem with cache coherency.
Instead of use kmalloc() call to allocate RAM memory for DMA should be use dma_alloc_coherent() which returns pointer to logical address for kernel and also in argument it returns physical address without cache (non-cached).
Here is my example/test code which is working for me and now physical memory is updated immediately with logical inside kernel memory space. Allocation of 1024 bytes in RAM.
static struct device *dev;
static dma_addr_t physical_address;
static unsigned int *logical_address;
static void ptr_init(void)
{
unsigned long long dma_mask = DMA_BIT_MASK(32);
dev->dma_mask = &dma_mask;
if (dma_set_mask_and_coherent(dev, dma_mask) != 0)
printk("Mask not OK\n");
else
printk("Mask OK\n");
logical_address = (unsigned int *)dma_alloc_coherent(dev, 1024, &physical_address, GFP_KERNEL);
if (logical_address != NULL)
printk("allocation OK \n");
else
printk("allocation NOT OK\n");
printk("logical address: %x\n", logical_address);
printk("physical address: %x\n", physical_address);
}

Limit Linux DMA allocation to specific range

I am working on a SOC which it's RAM starts at 0x200M.
For some reason I need to limit the DMA allocations up to 0x220M, so I called - dma_set_mask_and_coherent(dev, 0x21FFFFFFF).
I noticed that
dev->dma_mask was not set, while
dev->coherent_dma_mask was set.
Still, calling dma_alloc_coherent returns buffers above the requested limit.
How can I limit the buffer address?

/dev/mem or user space burst transfer; how to get faster /dev/mem access

setup
I have a bunch of RAM on the PL (programmable logic / FPGA) side of a zync-7000 chip. This memory can be accessed both via the PL and PS (processing system / CPU) side. The plan is for the CPU to load a large GiB buffer and hand it off to the PL.
Linux bursts to / from the RAM when device tree is modified
When I modify the device tree so linux can see the ram I observe fast read/write speeds; the hardware/firmware is capable of burst read/write.
memory {
device_type = "memory";
// The 512 MiB memory at 0x60000000
reg = <0x0 0x40000000 0x60000000 0x20000000>;
};
mmap device tree memory
The device tree is modified to prevent linux from using the RAM (so it can be used as a buffer for the PL instead)
memory {
device_type = "memory";
reg = <0x0 0x40000000>;
};
mmap is slow even after playing around with flags
I have tried several ways of setting up mmap()
int* addr_start = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, address);
int* addr_start = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_POPULATE, fd, address);
While reliable, none of them give fast results when running an iterate - write / read test
// words_per_page is on the order of 2**20/4
case TEST_WRITE:
for( int ii=0; ii < words_per_page; ii++)
*waddr++=count++;
break;
case TEST_READ:
for( int ii=0; ii < words_per_page; ii++)
sum += *raddr++;
break;
question
Are there any user space ways of creating direct burst transactions to / from memory? If not, relevant linux kernel links would be appreciated.
You definitely need to map the region as bufferable in order to maximize transfer speeds. You might need to use a different device driver than /dev/mem.
It is easier to control transfers from the programmable logic side if you use DMA to the Zynq host memory. On Zynq, I found I needed 8 maximum-length read requests in flight at a time to maximize throughput on a link.
If you need cache coherence with user space, you will need to use the ACP port so that the processor's cache will snoop on the memory writes from the programmable logic (PL).

Theoretical question: Fetching a 32-bit variable from 128 bit addressable memory

Suppose I have a processor that is running Linux kernel, and the processor is connected with a 128 bit bus to a memory. The memory is addressed as one address gives you the full 128 bits of data through the bus to processor side.
If I do a 32-bit assignment in the user space:
int a = *p
where &p is pointing to an address in the memory. That address will return full 128 bits of data through the bus to the processor.
Is the kernel able to translate this kind of access from user space? What would happen?

CUDA allocate memory in __device__ function

Is there a way in CUDA to allocate memory dynamically in device-side functions ?
I could not find any examples of doing this.
From the CUDA C Programming manual:
B.15 Dynamic Global Memory Allocation
void* malloc(size_t size);
void free(void* ptr);
allocate and free memory dynamically from a fixed-size heap in global memory.
The CUDA in-kernel malloc() function allocates at least size bytes from the device heap and returns a pointer to the allocated memory or NULL if insufficient memory exists to fulfill the request. The returned pointer is guaranteed to be aligned to a 16-byte boundary.
The CUDA in-kernel free() function deallocates the memory pointed to by ptr, which must have been returned by a previous call to malloc(). If ptr is NULL, the call to free() is ignored. Repeated calls to free() with the same ptr has undefined behavior.
The memory allocated by a given CUDA thread via malloc() remains allocated for the lifetime of the CUDA context, or until it is explicitly released by a call to free(). It can be used by any other CUDA threads even from subsequent kernel launches. Any CUDA thread may free memory allocated by another thread, but care should be taken to ensure that the same pointer is not freed more than once.
According to http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf you should be able to use malloc() and free() in a device function.
Page 122
B.15 Dynamic Global Memory Allocation
void* malloc(size_t size);
void free(void* ptr);
allocate and free memory dynamically from a fixed-size heap in global memory.
The example given in the manual.
__global__ void mallocTest()
{
char* ptr = (char*)malloc(123);
printf(“Thread %d got pointer: %p\n”, threadIdx.x, ptr);
free(ptr);
}
void main()
{
// Set a heap size of 128 megabytes. Note that this must
// be done before any kernel is launched.
cudaThreadSetLimit(cudaLimitMallocHeapSize, 128*1024*1024);
mallocTest<<<1, 5>>>();
cudaThreadSynchronize();
}
You need the compiler paramter -arch=sm_20 and a card that supports >2x architecture.

Resources