Provide several kernel buffers through mmap - linux-kernel

I have a kernel driver which allocates several buffers in kernel space (physically contiguous, aligned to page boundaries, and consisting of integral number of pages).
Next, I need to make my driver able to mmap some of these buffers to userspace (one buffer per mmap() call, of course). The driver registers single character device for that purpose.
Userspace program must be able to tell kernel which buffer it wants to mmap (for example, by specifying its index or unique ID, or physical address previously resolved through ioctl()).
I want to do so by using mmap()'s offset parameter, for example (from userspace):
mapped_ptr = mmap(NULL, buf_len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, (MAGIC + buffer_id) * PAGE_SIZE);
Where "MAGIC" is some magic number, and buffer_id is the buffer ID which I want to mmap.
Next, in the kernel part there will be something like this:
static int my_dev_mmap(struct file *filp, struct vm_area_struct *vma)
{
int bufferID = vma->vm_pgoff - MAGIC;
/*
* Convert bufferID to PFN by looking through driver's buffer descriptors
* Check length = vma->vm_end - vma->vm_start
* Call remap_pfn_range()
*/
}
But I think it is some sort of dirty way, because "offset" in the mmap() is not supposed to specify index or identifier, its role is to provide number of skipped bytes (or pages) from the beginning of mmap-ed device(or file) memory (which is supposed to be contiguous, right?).
However, i've already seen some drivers in mainline which use "offset" to distinguish between mmap-ed buffers.
Are there any alternative solutions to this?
P.S.
I need all this just because I'm dealing with some unusual SoC' graphics controller, which can operate only on physically contiguous, aligned to 8-byte boundary memory buffers. So, I can only allocate such buffers in kernel space and pass them to user space via mmap().
The most part of controller' programming (composing instruction batches and pushing them to kernel driver) is performed in user space.
Also, I can't just allocate single big chunk of physically contiguous memory, because in that case it needs to be really big (for ex., 16+ MiB) and alloc_pages_exact() will fail.

I don't see anything wrong with using the offset to pass the index in from userspace to your driver. If it bugs you, then just look at your driver as assembling a large buffer out of individual pages that it wants to present to userspace as virtually contiguous, so that the offset really is an offset into this buffer. But really in my opinion there's nothing wrong with doing things this way.
Another alternative, if you can use kernel 3.5 or newer, might be to use the "Contiguous Memory Allocator" (CMA) -- look at <linux/dma-contiguous.h> and drivers/base/dma-contiguous.c for more information. There's also https://lwn.net/Articles/486301/ as a reference but I don't know how much (if anything) changed between that article and getting the code merged into mainline.

Finally, I've chosen to mmap exactly one buffer per one opened device file descriptor (struct file in kernel) and implement control through ioctl(): one IOCTL for allocating new buffer, one for attaching to already allocated buffer with known ID, and another one to get information about buffer.
Usually, userspace will mmap() about 10..20 buffers at the same time, so it is nice and clean solution for this case.

Related

A heap manager for C/Pascal that automatically fills freed memory with zero bytes

What do you think about an option to fill freed (not actually used) pages with zero bytes? This may improve performance under Windows, and also under VMWare and other virtual machine environments? For example, VMWare and HyperV calculate hash of memory pages, and, if the contents is the same, mark this page as "shared" inside a virtual machine and between virtual machines on the same host, until the page is modified. It effectively decreases memory consumption. Windows does the same - it handles zero pages differently, treating them as free.
We could have the heap manager that would automatically fill memory with zeros when we call FreeMem/ReallocMem. As an alternative option, we could have a function that zeroizes empty memory by demand, i.e. only when this function is explicitly called. Of course, this function has to be thread-safe.
The drawback of filling memory with zeros is touching the memory, which might have already been turned into virtual, thus issuing page faults. Besides that, any memory store operations are slow, so our program will be slower, albeit to an unknown extent (maybe negligible).
If we manage to fill 4-K pages completely with zeros, the hypervisor or Windows will explicitly mark it as a zero page. But even partial zeroizing may be beneficial, since the hypervisor may compress pages using LZ or similar algorithms to save physical memory.
I just want to know your opinion whether the benefits of filling emptied heap memory with zero bytes by the heap manager itself will outweigh the disadvantages of such a technique.
Is zeroizing worth its price when we buy reduced physical memory consumption?
When you have a page whose contents you no longer care about but you still want to keep it allocated, you can call VirtualAlloc (and variants) and pass the MEM_RESET flag.
From VirtualAlloc on MSDN:
MEM_RESET
Indicates that data in the memory range specified by lpAddress and
dwSize is no longer of interest. The pages should not be read from or
written to the paging file. However, the memory block will be used
again later, so it should not be decommitted. This value cannot be
used with any other value.
Using this value does not guarantee that
the range operated on with MEM_RESET will contain zeros. If you want
the range to contain zeros, decommit the memory and then recommit it.
This gives the best of both worlds - you don't have the cost of zeroing the memory, and the system does not have the cost of paging it back in. You get to take advantage of the well-tuned memory manager which already has a zero-pool.
Similar functionality also exists on Linux under the MADV_FREE (or MADV_DONTNEED for Posix) flag to madvise. Glibc uses this function in the implementation of its heap.:
/*
* Stack:
* int shrink_heap (heap_info *h, long diff)
* int heap_trim (heap_info *heap, size_t pad) at arena.c:660
* void _int_free (mstate av, mchunkptr p, int have_lock) at malloc.c:4097
* void __libc_free (void *mem) at malloc.c:2948
* void free(void *mem)
*/
static int
shrink_heap (heap_info *h, long diff)
{
long new_size;
new_size = (long) h->size - diff;
/* ... snip ... */
__madvise ((char *) h + new_size, diff, MADV_DONTNEED);
/* ... snip ... */
h->size = new_size;
return 0;
}
If your heap is in user space this will never work. The kernel can only trust itself, not user space. If the kernel zeros a page, it can treat it as zero. If user space says it zeroed a page, the kernel would still have to check that. It might just as well zero it. One thing user space can do is to discard pages. Which marks them as "don't care". Then a kernel can treat them as zero. But manually zeroing pages in user space is futile.

strange behaviour using mmap

I'm using Angtsrom embedded linux kernel v.2.6.37, based on Technexion distribution.
DM3730 SoC, TDM3730 module, custom baseboard.
CodeSourcery toolchain v. 2010-09.50
Here is dataflow in my system:
http://i.stack.imgur.com/kPhKw.png
FPGA generates incrementing data, Kernel reads it via GPMC DMA. GPMC pack size = 512 data samples. Buffer size = 61440 32bit samples (=60 ram pages).
DMA buffer is allocated by dma_alloc_coherent and mapped to userspace by mmap() call. User application directly reads data from DMA buffer and saving to NAND using fwrite() call. User reads data by 4096 samples at once.
And what I see in my file? http://i.stack.imgur.com/etzo0.png
Red line means first border of ring buffer. Ooops! Small packs (~16 samples) starts to hide after border. Their values is accurately = "old" values of corresponding buffer position. But WHY? 16 samples is much lesser than DMA pack size and user read pack size, so there cannot be pointers mismatch.
I guess there is some mmap() feature is hiding somewhere. I have tried different flags for mmap() - such as MAP_LOCKED, MAP_POPULATE, MAP_NONBLOCK with no success. I completely missunderstanding this behaviour :(
P.S. When i'm using copy_to_user() from kernel instead of mmap() and zero-copy access, there is no such behaviour.

Why memset function make the virtual memory so large

I have a process will do much lithography calculation, so I used mmap to alloc some memory for memory pool. When process need a large chunk of memory, I used mmap to alloc a chunk, after use it then put it in the memory pool, if the same chunk memory is needed again in the process, get it from the pool directly, not used memory map again.(not alloc all the need memory and put it in the pool at the beginning of the process). Between mmaps function, there are some memory malloc not used mmap, such as malloc() or new().
Now the question is:
If I used memset() to set all the chunk data to ZERO before putting it in the memory pool, the process will use too much virtual memory as following, format is "mmap(size)=virtual address":
mmap(4198400)=0x2aaab4007000
mmap(4198400)=0x2aaab940c000
mmap(8392704)=0x2aaabd80f000
mmap(8392704)=0x2aaad6883000
mmap(67112960)=0x2aaad7084000
mmap(8392704)=0x2aaadb085000
mmap(2101248)=0x2aaadb886000
mmap(8392704)=0x2aaadba89000
mmap(67112960)=0x2aaadc28a000
mmap(2101248)=0x2aaae028b000
mmap(2101248)=0x2aaae0c8d000
mmap(2101248)=0x2aaae0e8e000
mmap(8392704)=0x2aaae108f000
mmap(8392704)=0x2aaae1890000
mmap(4198400)=0x2aaae2091000
mmap(4198400)=0x2aaae6494000
mmap(8392704)=0x2aaaea897000
mmap(8392704)=0x2aaaeb098000
mmap(2101248)=0x2aaaeb899000
mmap(8392704)=0x2aaaeba9a000
mmap(2101248)=0x2aaaeca9c000
mmap(8392704)=0x2aaaec29b000
mmap(8392704)=0x2aaaecc9d000
mmap(2101248)=0x2aaaed49e000
mmap(8392704)=0x2aaafd6a7000
mmap(2101248)=0x2aacc5f8c000
The mmap last - first = 0x2aacc5f8c000 - 0x2aaab4007000 = 8.28G
But if I don't call memset before put in the memory pool:
mmap(4198400)=0x2aaab4007000
mmap(8392704)=0x2aaab940c000
mmap(8392704)=0x2aaad2480000
mmap(67112960)=0x2aaad2c81000
mmap(2101248)=0x2aaad6c82000
mmap(4198400)=0x2aaad6e83000
mmap(8392704)=0x2aaadb288000
mmap(8392704)=0x2aaadba89000
mmap(67112960)=0x2aaadc28a000
mmap(2101248)=0x2aaae0a8c000
mmap(2101248)=0x2aaae0c8d000
mmap(2101248)=0x2aaae0e8e000
mmap(8392704)=0x2aaae1890000
mmap(8392704)=0x2aaae108f000
mmap(4198400)=0x2aaae2091000
mmap(4198400)=0x2aaae6494000
mmap(8392704)=0x2aaaea897000
mmap(8392704)=0x2aaaeb098000
mmap(2101248)=0x2aaaeb899000
mmap(8392704)=0x2aaaeba9a000
mmap(2101248)=0x2aaaec29b000
mmap(8392704)=0x2aaaec49c000
mmap(8392704)=0x2aaaecc9d000
mmap(2101248)=0x2aaaed49e000
The mmap last - first = 0x2aaaed49e000 - 0x2aaab4007000= 916M
So the first process will "out of memory" and killed.
In the process, the mmap memory chunk will not be fully used or not even used although it is alloced, I mean, for example, before calibration, the process mmap 67112960(64M), it will not used(write or read data in this memory region) or just used the first 2M bytes, then put in the memory pool.
I know the mmap just return virtual address, the physical memory used delay alloc, it will be alloced when read or write on these address.
But what made me confused is that, why the virtual address increase so much? I used the centos 5.3, kernel version is 2.6.18, I tried this process both on libhoard and the GLIBC(ptmalloc), both with the same behavior.
Do anyone meet the same issue before, what is the possible root cause?
Thanks.
VMAs (virtual memory areas, AKA memory mappings) do not need to be contiguous. Your first example uses ~256 Mb, the second ~246 Mb.
Common malloc() implementations use mmap() automatically for large allocations (usually larger than 64Kb), freeing the corresponding chunks with munmap(). So you do not need to mmap() manually for large allocations, your malloc() library will take care of that.
When mmap()ing, the kernel returns a COW copy of a special zero page, so it doesn't allocate memory until it's written to. Your zeroing is causing memory to be really allocated, better just return it to the allocator, and request a new memory chunk when you need it.
Conclusion: don't write your own memory management unless the system one has proven inadecuate for your needs, and then use your own memory management only when you have proved it noticeably better for your needs with real life load.

Determining precisely ram unused available in Win32

I use this routine to fill unused ram with zero.
It procures crash on some computers and is coarse
size = size - (size /10);
There is a more accurate way to determine the unused RAM amount to be filled with zeroes?
DWORDLONG getTotalSystemMemory(){
PROCESS_MEMORY_COUNTERS lMemInfo;
BOOL success = GetProcessMemoryInfo(
GetCurrentProcess(),
&lMemInfo,
sizeof(lMemInfo)
);
MEMORYSTATUSEX statex;
statex.dwLength = sizeof(statex);
GlobalMemoryStatusEx(&statex);
wprintf(L"Mem: %d\n", lMemInfo.WorkingSetSize);
return statex.ullAvailPhys - lMemInfo.WorkingSetSize;
}
void Zero(){
int size = getTotalSystemMemory();//-(1024*140000)
size = size - (size /10);
//if(size>1073741824) size=1073741824; //2^32-1
wprintf(L"Mem: %d\n", size);
BYTE* ar = new BYTE[size];
RtlSecureZeroMemory(ar,size);
delete[] ar;
}
This program does not do what you think it does. In fact, it is counterproductive. Fortunately, it is also unnecessary.
First of all, the program is unncessary Windows already has a thread whole sole job to zero out free pages, uncreatively known as the zero page thread. This blog entry goes into quite a bit of detail on how it works. Therefore, the way to fill free memory with zeroes is to do nothing because there is already somebody filling free memory with zeroes.
Second, the program does not do what you think it does because when an application allocates memory, the kernel makes sure that the memory is full of zeroes before giving it to the application. (If there are not enough pre-zeroed pages available, the kernel will zero out the pages right there.) Therefore, your program which writes out zeroes is just writing zeroes on top of zeroes.
Third, the program is counterproductive because it is not limiting itself to memory that is free. It is zeroing out a big chunk of memory that might have been busy. This may force other applications to give up their active memory so that it can be given to you.
The program is also counterproductive because even if it manages only to grab free memory, it dirties the memory (by writing zeroes) before freeing it. Returning dirty pages to the kernel puts them on the "dirty free memory" list, which means that the zero page thread has to go and zero them out again. (Redundantly, in this case, but the kernel doesn't bother checking whether a freed page is full of zeros before zeroing it out. Checking whether a page is full of zeroes is about as expensive as just zeroing it out anyway.)
It is unclear what the purpose of your program is. Why does it matter that free memory is full of zeroes or not?

Linux block driver merge bio's

I have a block device driver which is working, after a fashion. It is for a PCIe device, and I am handling the bios directly with a make_request_fn rather than use a request queue, as the device has no seek time. However, it still has transaction overhead.
When I read consecutively from the device, I get bios with many segments (generally my maximum of 32), each consisting of 2 hardware sectors (so 2 * 2k) and this is then handled as one scatter-gather transaction to the device, saving a lot of signaling overhead. However on a write, the bios each have just one segment of 2 sectors and therefore the operations take a lot longer in total. What I would like to happen is to somehow cause the incoming bios to consist of many segments, or to merge bios sensibly together myself. What is the right approach here?
The current content of the make_request_fn is something along the lines of:
Determine read/write of the bio
For each segment in the bio, make an entry in a scatterlist* with sg_set_page
Map this scatterlist to PCI with pci_map_sg
For every segment in the scatterlist, add to a device-specific structure defining a multiple-segment DMA scatter-gather operation
Map that structure to DMA
Carry out transaction
Unmap structure and SG DMA
Call bio_endio with -EIO if failed and 0 if succeeded.
The request queue is set up like:
#define MYDEV_BLOCK_MAX_SEGS 32
#define MYDEV_SECTOR_SIZE 2048
blk_queue_make_request(mydev->queue, mydev_make_req);
set_bit(QUEUE_FLAG_NONROT, &mydev->queue->queue_flags);
blk_queue_max_segments(mydev->queue, MYDEV_BLOCK_MAX_SEGS);
blk_queue_physical_block_size(mydev->queue, MYDEV_SECTOR_SIZE);
blk_queue_logical_block_size(mydev->queue, MYDEV_SECTOR_SIZE);
blk_queue_flush(mydev->queue, 0);
blk_queue_segment_boundary(mydev->queue, -1UL);
blk_queue_max_segments(mydev->queue, MYDEV_BLOCK_MAX_SEGS);
blk_queue_dma_alignment(mydev->queue, 0x7);

Resources