Are there any memory restrictions on Linux Kernel Modules? - linux-kernel

Are there any restrictions on memory usage by a Linux Kernel Module i.e Code Segment size or amount of global memory or any thing.

In 2.6.35, load_module() bails out if the length of the module to load exceeds 64 MB: http://lxr.linux.no/#linux+v2.6.35/kernel/module.c#L2118
vmalloc() is used to allocate space for the module -- this fails if you try to allocate more pages than available in your physical memory (which in turn will probably only be an issue for embedded stuff with low RAM)
Furthermore, kzalloc() (and in turn, kmalloc()) are used. Depending on the allocator used (SLAB, SLOB, SLUB), there may be restrictions as well. SLAB defines a KMALLOC_MAX_SIZE wich defines the maximum number of bytes you can allocate with a single call to kmalloc().

Related

allocation of memory more than 8 KB for device driver in linux kernel space

I wrote a code for the device driver in kernel space, I would like to allocate more than 8KB by using _get_free_pages(size, flags).
Can you please provide the solution to allocate large lower memory in kernel space? (NOTE: Contiguous and physical memory is required here)
Initially, I was able to allocate 4KB, by changing the order it improved to 8KB, but I would like to allocate more than that.

Memory allocation under uCOS-III

I'm developing a C-library to be used under uCOS-III. The CPU is an ARM Cortex M4 SAM4C. Within the library I want to use a third party product X, whose particular name is not relevant here. The source code for X is completely available and compiles without problems.
Inside X a lot of memory allocations are executed, using calloc() and free().
The problem is, that plain usage of malloc is not advisable for embedded systems, because of memory fragmentation. The documentation for uCOS-III explicitly advises against using malloc - instead OSMemCreate/OSMemGet/OSMemPut shall be used to allocate and free chunks of memory out of a statically allocated memory block.
Question-1:
What is the general advice to get around the "standard implementation" of malloc? I would prefer a kind of malloc, where I have access to a fixed memory pool (e.g. dedicated for a special task)
Question-2:
How should OSMemCreate() be used correctly? I have first to initialize a memory partition with a certain block size. The amount of requested memory is between 4 Bytes and about 800 bytes. I can get blocks on request, but with fixed size. If block-size=4 I cannot allocate 16 Bytes, since blocks are not contiguous in memory. If block-size=800 and I need only 4 bytes, most of the block is left unused and I will very soon run out of blocks.
So I don't know, how to solve my original problem by use of OSMemCreate...
Can anybody give me an advice how I could proceed?
Many thanks,
Michael
1) Don't link with the standard library version of malloc/free. Instead create your own implementation of malloc/free that serves as a wrapper to OSMemGet/OSMemPut.
2) You can create more than one memory partition with OSMemCreate. Create small, medium, and large partitions that hold block sizes which are tuned for your application to reduce waste.
If you want malloc to get an appropriately sized block from your various memory partitions then you'll have to invent some magic so that free returns the block to the appropriate memory partition. (Perhaps malloc allocates an extra word, stores the pointer to the memory partition in the first word, and then returns the address after the word where the pointer is stored. Then free knows to get the memory partition pointer from the preceding word.)
An alternative to using malloc/free is to rewrite that code to use statically allocated variables or call OSMemGet/OSMemPut directly.

Why is the kernel concerned about issuing PHYSICALLY contiguous pages?

When a process requests physical memory pages from the Linux kernel, the kernel does its best to provide a block of pages that are physically contiguous in memory. I was wondering why it matters that the pages are PHYSICALLY contiguous; after all, the kernel can obscure this fact by simply providing pages that are VIRTUALLY contiguous.
Yet the kernel certainly tries its hardest to provide pages that are PHYSICALLY contiguous, so I'm trying to figure out why physical contiguity matters so much. I did some research &, across a few sources, uncovered the following reasons:
1) makes better use of the cache & achieves lower avg memory access times (GigaQuantum: I don’t understand: how?)
2) you have to fiddle with the kernel page tables in order to map pages that AREN’T physically contiguous (GigaQuantum: I don’t understand this one: isn’t each page mapped separately? What fiddling has to be done?)
3) mapping pages that aren’t physically contiguous leads to greater TLB thrashing (GigaQuantum: I don’t understand: how?)
Per the comments I inserted, I don't really understand these 3 reasons. Nor did any of my research sources adequately explain/justify these 3 reasons. Can anyone explain these in a little more detail?
Thanks! Will help me to better understand the kernel...
The main answer really lies in your second point. Typically, when memory is allocated within the kernel, it isn't mapped at allocation time - instead, the kernel maps as much physical memory as it can up-front, using a simple linear mapping. At allocation time it just carves out some of this memory for the allocation - since the mapping isn't changed, it has to already be contiguous.
The large, linear mapping of physical memory is efficient: both because large pages can be used for it (which take up less space for page table entries and less TLB entries), and because altering the page tables is a slow process (so you want to avoid doing this at allocation/deallocation time).
Allocations that are only logically linear can be requested, using the vmalloc() interface rather than kmalloc().
On 64 bit systems the kernel's mapping can encompass the entireity of physical memory - on 32 bit systems (except those with a small amount of physical memory), only a proportion of physical memory is directly mapped.
Actually the behavior of memory allocation you describe is common for many OS kernels and the main reason is kernel physical pages allocator. Typically, kernel has one physical pages allocator that is used for allocation of pages for both kernel space (including pages for DMA) and user space. In kernel space you need continuos memory, because it's expensive (for in-kernel code) to map pages every time you need them. On x86_64, for example, it's completely worthless because kernel can see the whole address space (on 32bit systems there's 4G limitation of virtual address space, so typically top 1G are dedicated to kernel and bottom 3G to user-space).
Linux kernel uses buddy algorithm for page allocation, so that allocation of bigger chunk takes fewer iterations than allocation of smaller chunk (well, smaller chunks are obtained by splitting bigger chunks). Moreover, using of one allocator for both kernel space and user space allows the kernel to reduce fragmentation. Imagine that you allocate pages for user space by 1 page per iteration. If user space needs N pages, you make N iterations. What happens if kernel wants some continuos memory then? How can it build big enough continuos chunk if you stole 1 page from each big chunk and gave them to user space?
[update]
Actually, kernel allocates continuos blocks of memory for user space not as frequently as you might think. Sure, it allocates them when it builds ELF image of a file, when it creates readahead when user process reads a file, it creates them for IPC operations (pipe, socket buffers) or when user passes MAP_POPULATE flag to mmap syscall. But typically kernel uses "lazy" page loading scheme. It gives continuos space of virtual memory to user-space (when user does malloc first time or does mmap), but it doesn't fill the space with physical pages. It allocates pages only when page fault occurs. The same is true when user process does fork. In this case child process will have "read-only" address space. When child modifies some data, page fault occurs and kernel replaces the page in child address space with a new one (so that parent and child have different pages now). Typically kernel allocates only one page in these cases.
Of course there's a big question of memory fragmentation. Kernel space always needs continuos memory. If kernel would allocate pages for user-space from "random" physical locations, it'd be much more hard to get big chunk of continuos memory in kernel after some time (for example after a week of system uptime). Memory would be too fragmented in this case.
To solve this problem kernel uses "readahead" scheme. When page fault occurs in an address space of some process, kernel allocates and maps more than one page (because there's possibility that process will read/write data from the next page). And of course it uses physically continuos block of memory (if possible) in this case. Just to reduce potential fragmentation.
A couple of that I can think of:
DMA hardware often accesses memory in terms of physical addresses. If you have multiple pages worth of data to transfer from hardware, you're going to need a contiguous chunk of physical memory to do so. Some older DMA controllers even require that memory to be located at low physical addresses.
It allows the OS to leverage large pages. Some memory management units allow you to use a larger page size in your page table entries. This allows you to use fewer page table entries (and TLB slots) to access the same quantity of virtual memory. This reduces the likelihood of a TLB miss. Of course, if you want to allocate a 4MB page, you're going to need 4MB of contiguous physical memory to back it.
Memory-mapped I/O. Some devices could be mapped to I/O ranges that require a contiguous range of memory that spans multiple frames.
Contiguous or Non-Contiguous Memory Allocation request from the kernel depends on your application.
E.g. of Contiguous memory allocation: If you require a DMA operation to be performed then you will be requesting the contiguous memory through kmalloc() call as DMA operation requires a memory which is also physically contiguous , as in DMA you will provide only the starting address of the memory chunk and the other device will read or write from that location.
Some of the operation do not require the contiguous memory so you can request a memory chunk through vmalloc() which gives the pointer to non contagious physical memory.
So it is entirely dependent on the application which is requesting the memory.
Please remember that it is a good practice that if you are requesting the contiguous memory than it should be need based only as kernel is trying best to allocation the memory which is physically contiguous.Well kmalloc() and vmalloc() has their limits also.
Placing things we are going to be reading a lot physically close together takes advantage of spacial locality, things we need are more likely to be cached.
Not sure about this one
I believe this means if pages are not contiguous, the TLB has to do more work to find out where they all are. If they are contigous, we can express all the pages for a processes as PAGES_START + PAGE_OFFSET. If they aren't, we need to store a seperate index for all of the pages of a given processes. Because the TLB has a finite size and we need to access more data, this means we will be swapping in and out a lot more.
kernel does not need physically contiguous pages actually it just needs efficencies ans stabilities.
monolithic kernel tends to have one page table for kernel space shared among processes
and does not want page faults on kernel space that makes kernel designs too complex
so usual implementations on 32 bit architecture is always 3g/1g split for 4g address space
for 1g kernel space, normal mappings of code and data should not generate recursive page faults that is too complex to manage:
you need to find empty page frames, create mapping on mmu, and handle tlb flush for new mappings on every kernel side page fault
kernel is already busy of doing user side page faults
furthermore, 1:1 linear mapping could have much less page table entries because it can utilize bigger size of page unit (>4kb)
less entries leads to less tlb misses.
so buddy allocator on kernel linear address space always provides physically contiguous page frames
even most codes doesn't need contiguous frames
but many device drivers which need contiguous page frames already believe that allocated buffers through general kernel allocator are physically contiguous

zone_NORMAL and ZONE_HIGHMEM on 32 and 64 bit kernels

I trying to to make the linux memory management a little bit more clear for tuning and performances purposes.
By reading this very interesting redbook "Linux Performance and Tuning Guidelines" found on the IBM website I came across something I don't fully understand.
On 32-bit architectures such as the IA-32, the Linux kernel can directly address only the first gigabyte of physical memory (896 MB when considering the reserved range). Memory above the so-called ZONE_NORMAL must be mapped into the lower 1 GB. This mapping is completely transparent to applications, but allocating a memory page in ZONE_HIGHMEM causes a small performance degradation.
why the memory above 896 MB has to be mapped into the lower 1GB ?
Why there is an impact on performances by allocating a memory page in ZONE_HIGHMEM ?
what is the ZONE_HIGHMEM used for then ?
why a kernel that is able to recognize up to 4gb ( CONFIG_HIGHMEM=y ) can just use the first gigabyte ?
Thanks in advance
When a user process traps in to the kernel, the page tables are not changed. This means that one linear address space must be able to cover both the memory addresses available to the user process, and the memory addresses available to the kernel.
On IA-32, which allows a 4GB linear address space, usually the first 3GB of the linear address space are allocated to the user process, and the last 1GB of the linear address space is allocated to the kernel.
The kernel must use its 1GB range of addresses to be able to address any part of physical memory it needs to. Memory above 896MB is not "mapped into the low 1GB" - what happens is that physical memory below 896MB is assigned a permanent linear address in the kernel's part of the linear address space, whereas as memory above that limit must be assigned a temporary mapping in the remaining part of the linear address space.
There is no impact on performance when mapping a ZONE_HIGHMEM page into a userspace process - for a userspace process, all physical memory pages are equal. The impact on performance exists when the kernel needs to access a non-user page in ZONE_HIGHMEM - to do so, it must map it into the linear address space if it is not already mapped.

Howto create 100M Byte Buffer

I am testing the throughput of an interface on Linux. I am using the DMA todo the data transfer. DMA needs contiguous memory location. But the kmalloc is unable to allocate more then 1MB. Is there any other way to create big buffer location upto 100M Bytes?
I thought kmalloc couldn't allocate over 128kB, how did you get it to allocate 1MB ?
Anyway, assuming you're working on a freshly booted system, you can reserve up to 2048 contiguous pages. Pages are generally 4k, so this makes 8MB.
_get_free_pages(_GFP_DMA, get_order(2048));
However, if you need more memory, you should do the allocation at boot-time.
If you are writing a driver, this can be achieved with the alloc_bootmem_* functions.
If you are writing a module, you have to pass mem= argument to your kernel and later use ioremap.
For example, if you have 2GB, you can pass mem=1GB to forbid the kernel from using the upper 1GB, and later call ioremap(0x40000000, 0x40000000) to get access to the upper 1GB, just for you.
But you know, you should just split your huge buffer into many small ones, it'll be much easier and much more like real-life applications.

Resources