How to implement MAP_HUGETLB in a character device driver? - linux-kernel

I have a character driver exposing a character device file under /dev. I'd like to use the huge pages when I map some menory.
MAP_HUGETLB seems to be only available when grouped with MAP_ANONYMOUS or with transparent huge-page but am not interested in it.
mmap(...,MAP_HUGETLB|MAP_ANONYMOUS,..., -1, 0);
How can we implement the huge-pages feature for a character device? Is it already done somewhere?
I didn't find an example the kernel's tree.

This isn't possible. You can only mmap files with MAP_HUGETLB if they reside in a hugetlbfs filesystem. Since /proc is a procfs filesystem, you have no way of mapping those files through huge pages.
You can also see this from the checks performed in mmap by the kernel:
/* ... */
if (!(flags & MAP_ANONYMOUS)) { // <== File-backed mapping?
audit_mmap_fd(fd, flags);
file = fget(fd);
if (!file)
return -EBADF;
if (is_file_hugepages(file)) { // <== Check that FS is hugetlbfs
len = ALIGN(len, huge_page_size(hstate_file(file)));
} else if (unlikely(flags & MAP_HUGETLB)) { // <== If not, MAP_HUGETLB isn't allowed
retval = -EINVAL;
goto out_fput;
}
} else if (flags & MAP_HUGETLB) { // <== Anonymous MAP_HUGETLB mapping?
/* ... */
See also:
How to use Linux hugetlbfs for shared memory maps of files?
This answer providing some detailed explanations and examples

Related

MMAP buffer kernel writes are not seen by user space

i have a kernel driver which shares a buffer with the user space layer.
Everything seemed to work fine in my VM prototype (Ubuntu, Kernel 5.4) but when i moved my code to the target (same kernel but this is an embedded distro) I can clearly see that Kernel writes to the buffer (using memcpy, or memset) are not reflected in the User space side of the buffer.
Note that, i use direct buffer accesses on both sides. There is no concurrency issue, as the Kernel writes to, then the user space reads from.
I ended up believing this is a cache issue ... as the same code works perfectly in my VM.
The buffer size is 4 * PAGE_SIZE.
It is allocated as follows:
int _size = (SFP_BUFFER_SIZE + (PAGE_SIZE-1)) & ~(PAGE_SIZE-1);
input_buffer = (char*) kzalloc (_size, GFP_KERNEL); // aligned on page boundary
if (!input_buffer) {
dev_dbg(&dev, "open/ENOMEM (input_buffer)\n");
status = -ENOMEM;
goto err_all
When mmap'ing, i used the following code pattern:
vma->vm_ops = &fpgadrv_vm_ops;
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
pfn = virt_to_phys((void*)(input_buffer)) >> PAGE_SHIFT;
if (remap_pfn_range (vma, vma->vm_start, pfn, size, vma->vm_page_prot))
{
printk(KERN_DEBUG "remap page range failed\n");
return -EAGAIN;
}
User space code, and kernel code user memcpy to update the buffer. Note also that I cannot use write/read entry points, as they are already used for very specific operations.
The user code is calling mmap as follows:
buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, device_fd, 0);
if (buf == MAP_FAILED)
{
perror("USERDRV:cannot mmap");
return -1; // for testing, ignore the return code and continue
}
and upon IOCTL call, the kernel would fill up the mmap buffer as follows:
case IOCTL_RESET:
printk(KERN_DEBUG "FPGADRV: IOCTL RESET");
// reset the buffer (zero + put back the signature)
memset(input_buffer, 0xA5, SFP_BUFFER_SIZE);
memcpy((void*)(input_buffer), (void*)signature, 10);
break;
Is there something more i should do to make sure the pages are not cached (assuming this is the cause of my pb) ?
Thanks,
Jacques

Simulating (lazy) NAND memory on Windows

I'm running a firmware simulation in a DLL which has simulated NAND (256MB or 1GB). I want to avoid allocating memory for this on the heap and instead allocate using virtual memory.
The memory initially needs to be cleared to 0xFF (like NAND is). However I don't want to pay for that initialization (nor commit un-accessed pages). So ideally it should only allocate upon access. And I do not need to retain the data following exit of the simulation.
Initial ideas are
VirtualAlloc. Not sure but thinking perhaps could use guard page and then trap the exception on first access. Not sure its ideal that a DLL handles such SEH exceptions? Or is there a better way?
Create a big file that's initialized to 0xFF. Then map view of file with copy-on-write.
Anyone know if it is possible to create a file with a callback for providing the initial data?
Think probably 1) the way to go but wondering if that's really the best option.
Edit:
3) I've come up with another method that can avoid exception handler and also avoids creating a huge file:
Create a file that is same size as dwAllocationGranularity (64KiB typically). Fill with 0xFF. Then create multiple copy-on-write views of that in contiguous memory using MapViewOfFileEx + FILE_MAP_COPY (after an initial VirtualAlloc/VirtualFree to get a suitable base address that we can hope to allocate juxtapositioned views). Need to test this a bit more fully - slight concern about potential thread races.. I'm ony actually using a single thread but the CRT does start a few too.
This means that any code that only reads the virtual NAND also does not result in all pages getting committed.
yes, basically 1 is best solution. only i be do next changes - use VEH instead SEH - SEH handler will be called only if you access memory inside it, when in case VEH - access can be ai any context and thread. and instead use guard page, i be initial only reserve region of memory without real allocation. so any access to memory region lead to exception, you handle it in VEH - commit memory and fill with 0xFF pattern. demo code
PVOID g_NandBegin;
SIZE_T g_NandSize = 0x1000000;
LONG NTAPI Vex(::PEXCEPTION_POINTERS ExceptionInfo)
{
::PEXCEPTION_RECORD ExceptionRecord = ExceptionInfo->ExceptionRecord;
if (ExceptionRecord->ExceptionCode == STATUS_ACCESS_VIOLATION &&
ExceptionRecord->NumberParameters > 1)
{
PVOID pv = (PVOID)ExceptionRecord->ExceptionInformation[1];
if ((ULONG_PTR)pv - (ULONG_PTR)g_NandBegin < g_NandSize)
{
SIZE_T RegionSize = 1;
if (0 <= NtAllocateVirtualMemory(NtCurrentProcess(), &pv, 0, &RegionSize, MEM_COMMIT, PAGE_READWRITE))
{
RtlFillMemoryUlong(pv, RegionSize, MAXULONG);
return EXCEPTION_CONTINUE_EXECUTION;
}
}
}
return EXCEPTION_CONTINUE_SEARCH;
}
void dc()
{
if (PVOID pv = AddVectoredExceptionHandler(TRUE, Vex))
{
if (g_NandBegin = VirtualAlloc(0, g_NandSize, MEM_RESERVE, PAGE_READWRITE))
{
ULONG seed = ~GetTickCount();
int n = 0x100;
do
{
if (*(UCHAR*)((PBYTE)g_NandBegin + (((ULONG64)RtlRandomEx(&seed) * g_NandSize) >> 32)) != 0xFF)
{
__debugbreak();
}
} while (--n);
VirtualFree(g_NandBegin, 0, MEM_RELEASE);
}
RemoveVectoredExceptionHandler(pv);
}
}

How to work with reserved CMA memory?

I would like to allocate piece of physically contiguous reserved memory (in predefined physical addresses) for my device with DMA support.
As I see CMA has three options:
1. To reserve memory via kernel config file. 2. To reserve memory via kernel cmdline. 3. To reserve memory via device-tree memory node.
In the first case: size and number of areas could be reserved.
CONFIG_DMA_CMA=y
CONFIG_CMA_AREAS=7
CONFIG_CMA_SIZE_MBYTES=8
So I could use:
start_cma_virt = dma_alloc_coherent(dev->cmadev, (size_t)size_cma, &start_cma_dma, GFP_KERNEL);
in my driver to allocate contiguous memory. I could use it max 7 times and it will be possible allocate up to 8M. But unfortunately
dma_contiguous_reserve(min(arm_dma_limit, arm_lowmem_limit));
from arch/arm/mm/init.c:
void __init arm_memblock_init(struct meminfo *mi,const struct machine_desc *mdesc)
it is impossible to set predefined physical addresses for contiguous allocation.
Of Course I could use kernel cmdline:
mem=cma=cmadevlabel=8M#32M cma_map=mydevname=cmadevlabel
//struct device *dev = cmadev->dev; /*dev->name is mydevname*/
After that dma_alloc_coherent() should alloc memory in physical memory area from 32M + 8M (0x2000000 + 0x800000) up to 0x27FFFFF.
But unfortunately I have problem with this solution. Maybe my cmdline has error?
Next one try was device tree implementation.
cmadev_region: mycma {
/*no-map;*/ /*DMA coherent memory*/
/*reusable;*/
reg = <0x02000000 0x00100000>;
};
And phandle in some node:
memory-region = <&cmadev_region>;
As I saw in kernel usual it should be used like:
of_find_node_by_name(); //find needed node
of_parse_phandle(); //resolve a phandle property to a device_node pointer
of_get_address(); //get DT __be32 physical addresses
of_translate_address(); //DT represent local (bus, device) addresses so translate it to CPU physical addresses
request_mem_region(); //reserve IOMAP memory (cat /proc/iomem)
ioremap(); //alloc entry in page table for reserved memory and return kernel logical addresses.
But I want use DMA via (as I know only one external API function dma_alloc_coherent) dma_alloc_coherent() instead IO-MAP ioremap(). But how call
start_cma_virt = dma_alloc_coherent(dev->cmadev, (size_t)size_cma, &start_cma_dma, GFP_KERNEL);
associate memory from device-tree (reg = <0x02000000 0x00100000>;) to dev->cmadev ? In case with cmdline it is clear it has device name and addresses region.
Does reserved memory after call of_parse_phandle() automatically should be booked for your special driver (which parse DT). And next call dma_alloc_coherent will allocate dma area inside memory from cmadev_region: mycma?
To use dma_alloc_coherent() on reserved memory node, you need to declare that area as dma_coherent. You can do some thing like:
In dt:
cmadev_region: mycma {
compatible = "compatible-name"
no-map;
reg = <0x02000000 0x00100000>;
};
In your driver:
struct device *cma_dev;
static int rmem_dma_device_init(struct reserved_mem *rmem, struct device *dev)
{
int ret;
if (!mem) {
ret = dma_declare_coherent_memory(cma_dev, rmem->base, rmem->base,
rmem->size, DMA_MEMORY_EXCLUSIVE);
if (ret) {
pr_err("Error");
return ret;
}
}
return 0;
}
static void rmem_dma_device_release(struct reserved_mem *rmem,
struct device *dev)
{
if (dev)
dev->dma_mem = NULL;
}
static const struct reserved_mem_ops rmem_dma_ops = {
.device_init = rmem_dma_device_init,
.device_release = rmem_dma_device_release,
};
int __init cma_setup(struct reserved_mem *rmem)
{
rmem->ops = &rmem_dma_ops;
return 0;
}
RESERVEDMEM_OF_DECLARE(some-name, "compatible-name", cma_setup);
Now on this cma_dev you can perform dma_alloc_coherent and get memory.

Interrupt performance on linux kernel with RT patches - should be better?

I have bumped into a bit inconsistent IRQ/ISR performance on Freescales imx.233 running linux kernel (3.8.13) with CONFIG_PREEMPT_RT patches.
I am little bit surprised why this processor (ARM9, 454mhz) is unable to keep up even with 74kHz IRQ requests.. ?
In my kernel config I have set following flags:
CONFIG_TINY_PREEMPT_RCU=y
CONFIG_PREEMPT_RCU=y
CONFIG_PREEMPT=y
CONFIG_PREEMPT_RT_BASE=y
CONFIG_HAVE_PREEMPT_LAZY=y
CONFIG_PREEMPT_LAZY=y
CONFIG_PREEMPT_RT_FULL=y
CONFIG_PREEMPT_COUNT=y
CONFIG_DEBUG_PREEMPT=y
On the system there is basically nothing running (created by buildroot), and I set PWM to generate a pulse of 74kHz, that serves as interrupt.
Then in the ISR, I just trigger another GPIO output pin, and check the output.
What I find is that sometimes I miss an interrupt -
You can see the missed interrupt here:
And also the the triggering of output pin seems to be a bit inconsistent, the output pin is triggered usually within "5% window", that might still be acceptable. But I worry, that when I start implementing data transfer logic, instead of just triggering the pin, I might run into further problems...
My simple driver code looks like this:
#needed includes
uint16_t INPUT_IRQ = 39;
uint16_t OUTPUT_GPIO = 38;
struct test_device *device;
//Prototypes
void irqtest_exit(void);
int irqtest_init(void);
void free_device(void);
//Default functions
module_init(irqtest_init);
module_exit(irqtest_exit);
//triggering flag
uint16_t pulse = 0x1;
irqreturn_t irq_handle_function(int irq, void *device_id)
{
pulse = !pulse;
gpio_set_value(OUTPUT_GPIO, pulse);
return IRQ_HANDLED;
}
struct test_device {
int huuhaa;
};
void free_device() {
if (device)
kfree(device);
}
int irqtest_init(void) {
int result = 0;
device = kmalloc(sizeof *device, GFP_KERNEL);
device->huuhaa = 10;
printk("IRB/irqtest_init: Inserting IRQ module\n");
printk("IRB/irqtest_init: Requesting GPIO (%d)\n", INPUT_IRQ);
result = gpio_request_one(INPUT_IRQ, GPIOF_IN, "PWM input");
if (result != 0) {
free_device();
printk("IRB/irqtest_init: Failed to set GPIO (%d) as input.. exiting\n", INPUT_IRQ);
return -EINVAL;
}
result = gpio_request_one(OUTPUT_GPIO, GPIOF_OUT_INIT_LOW , "IR OUTPUT");
if (result != 0) {
free_device();
printk("IRB/irqtest_init: Failed to set GPIO (%d) as output.. exiting\n", OUTPUT_GPIO);
return -EINVAL;
}
//Set our desired interrupt line as input
result = gpio_direction_input(INPUT_IRQ);
if (result != 0) {
printk("IRB/irqtest_init: Failed to set IRQ as input.. exiting\n");
free_device();
return -EINVAL;
}
//Set flags for our interrupt, guessing here..
irq_flags |= IRQF_NO_THREAD;
irq_flags |= IRQF_NOBALANCING;
irq_flags |= IRQF_TRIGGER_RISING;
irq_flags |= IRQF_NO_SOFTIRQ_CALL;
//register interrupt
result = request_irq(gpio_to_irq(INPUT_IRQ), irq_handle_function, irq_flags, "irq testing", device);
if (result != 0) {
printk("IRB/irqtest_init: Failed to reserve GPIO 38\n");
return -EINVAL;
}
printk("IRB/irqtest_init: insert success\n");
return 0;
}
void irqtest_exit(void) {
if (device)
kfree(device);
gpio_free(INPUT_IRQ);
gpio_free(OUTPUT_GPIO);
printk("IRB/irqtest_exit: Removing irqtest module\n");
}
int irqtest_open(struct inode *inode, struct file *filp) {return 0;}
int irqtest_release(struct inode *inode, struct file *filp) {return 0;}
In the system, I have following interrupts registered, after the driver is loaded:
# cat /proc/interrupts
CPU0
16: 36379 - MXS Timer Tick
17: 0 - mxs-spi
18: 2103 - mxs-dma
60: 0 gpio-mxs irq testing
118: 0 - mxs-spi
119: 0 - mxs-dma
120: 0 - RTC alarm
124: 0 - 8006c000.serial
127: 68050 - uart-pl011
128: 151 - ci13xxx_imx
Err: 0
I wonder if the flags I declare to my IRQ are good ? I noticed that with this configuration, I can no longer reach console, so kernel seems totally consumed with servicing this 74kHz trigger now.. this can't be right ?
I suppose it's not a big deal for me since this is only during data transfer, but still I feel I'm doing something wrong..
Also, I wonder if it would be more efficient to map the registers with ioremap, and trigger the output with direct memory writes ?
Is there some way I could increase the priority of the interrupt even higher ? Or could I somehow lock the kernel for the duration of the data transfer (~400ms), and generate somehow else my timing for the output ?
Edit: Forgot to add /proc/interrupts output to the question...
What you experience here is interrupt jitter. This is to be expected on Linux, because the kernel regularly disables the interrupts for various tasks (entering a spinlock, handling an interrupt, etc.).
This will happen, regardless wether you have PREEMPT_RT or not, so expecting to generate 74kHz signal with regular interrupts is pretty much unrealistic.
Now, ARM has higher priority interrupts called FIQs, that will never be masked or disabled.
Linux doesn't use FIQ, and is not built to deal with the fact that an FIQ could be used, so you won't be able to use the generic kernel framework.
From Linux driver development point of view however, it's not really different as long as you keep this in mind: you have to write a handler, and associate it to an IRQ. You'll also have to poke into the interrupt controller to make it generate a FIQ for the interrupt you want to use (the details on how to change it are platform-dependant. Some platforms have functions to do that (like imx25 and mxc_set_irq_fiq), some others don't. imx23/28 don't, so you'll have to do it by hand).
The only thing that the functions to setup a fiq handler only work with a assembly-written handler, so you'll have to rewrite your handler in assembly (with your current code, it should be trivial though).
You can grab additional details to the blog post Alexandre posted (http://free-electrons.com/blog/fiq-handlers-in-the-arm-linux-kernel/), where you'll find working code, samples, and explanations on how it all works together.
You can have a look at what my colleague Maxime Ripard did using an FIQ on a similar SoC (i.mx28) :
http://free-electrons.com/blog/fiq-handlers-in-the-arm-linux-kernel/
Try this flags:
int irq_flags;
...
irq_flags = IRQF_TRIGGER_RISING | IRQF_EARLY_RESUME
I had a kernel 3.8.11 and can't find IRQF_NO_SOFTIRQ_CALL define. It's only for 3.8.13?
Also I didn't see irq_flags define. Where is it?

corrupted pointer in 'net_device'

the device driver I'm working on is implementing a virtual device. The logic
is as follows:
static struct net_device_ops virt_net_ops = {
.ndo_init = virt_net_init,
.ndo_open = virt_net_open,
.ndo_stop = virt_net_stop,
.ndo_do_ioctl = virt_net_ioctl,
.ndo_get_stats = virt_net_get_stats,
.ndo_start_xmit = virt_net_start_xmit,
};
...
struct net_device *dev;
struct my_dev *virt;
dev = alloc_netdev(..);
/* check for NULL */
virt = netdev_priv(dev);
dev->netdev_ops = &virt_net_ops;
SET_ETHTOOL_OPS(dev, &virt_ethtool_ops);
dev_net_set(dev, net);
virt->magic = MY_VIRT_DEV_MAGIC;
ret = register_netdev(dev);
if (ret) {
printk("register_netdev failed\n");
free_netdev(dev);
return ret;
}
...
What happens is that somewhere somehow the pointer net_device_ops in
'net_dev' gets corrupted, i.e.
1) create the device the first time (allocated net_dev, init the fields
including net_device_ops,which is
initialized with a static structure containing function pointers), register
the device with the kernel invoking register_netdev() - OK
2) attempt to create the device with the same name again, repeat the above
steps, call register_netdev() which will return negative and we
free_netdev(dev) and return error to the caller.
And between these two events the pointer to net_device_ops has changed,
although nowhere in the code it is done explicitly except the initialization
phase.
The kernel version is 2.6.31.8, platform MIPS. Communication channel between the user space and the kernel is implemented via netlink sockets.
Could anybody suggest what possibly can go wrong?
Appreciate any advices, thanks.
Mark
"The bug is somewhere else. "
The second device should not interact with the existing one. If you register_netdev with an existing name, nevertheless the ndo_init virtual function is called first before the condition is detected and -EEXIST is returned. Maybe your init function does something nasty involving some global variables. (For example, does the code assume there is one device, and stash a global pointer to it during initialization?)

Resources