Is there any way to allocate exact physical memory address with caching in ubuntu? - memory-management

I want to allocate a fixed sized buffer in a physical memory location that would be exactly same across different runs.
void * getBuffer(size_t len, uintptr_t paddr);
So this function getBuffer(len,paddr) should allocate a buffer of len at the exact physical address paddr
So far I have seen the usage of "/dev/mem" based memory allocation that uses DMA, but this does not follow the cache access pattern according to this, this and others. (Please correct me here if I am wrong.)
But I want make sure that the memory access from the getBuffer() follows regular CPU cache patterns. i.e, the access is goes through L1, L2 and L3 cache and finally write/reads the memory.
Please let me know if this is possible. If not, is there any other way of accessing a fixed physical address? Feel free to share!

Related

Virtual Address to Physical address translation in the light of the cache memory

I do understand how the a virtual address is translated to a physical address to access the main memory. I also understand how the cache memory works as well.
But my problem is in putting the 2 concepts together and understanding the big picture of how a process accesses memory and what will happen if we have a cache miss. so i have this drawing that will help me asks the following questions:
click to see the image ( assume one-level cache)
1- Does the process access the cache with the exact same physical address that represent the location of byte in the main memory ?
2- Is the TLB actually in the first level of Cache or is it a separate memory inside the CPU chip that is dedicated for the translation purpose ?
3- When there is a cache miss, i need to get a whole block and allocated in the cache, but the main memory organized in frames(pages) not blocks. So does a process page is divided itself to cache blocks that can be brought to cache in case of a miss ?
4- Lets assume there is a TLB miss, does that mean that I need to go all the way to the main memory and do the page walk there , or does the page walk happen in the cache ?
5- Does a TLB miss guarantee that there will be a cache miss ?
6- If you have any reading material that explain the big picture that i am trying to understand i would really appreciate sharing it with me.
Thanks and feel free to answer any single question i have asked
Yes. The cache is not memory that can be addressed separately. Cache mapping will translate a physical address into an address for the cache but this mapping is not something a process usually controls. For some CPU architecture it is completely controlled by the hardware (e.g. Intel x86). For others the operating system would be expected to program the mapping.
The TLB in the diagram you gave is for virtual to physical address mapping. It is probably not for the cache. Again on some architecture the TLBs are programmed whereas on others it is controlled by the hardware.
Page size and cache line size do not have to be the same as one relates to virtual memory and the other to physical memory. When a process access a virtual address that address will be translated to a physical address using the TLB considering page size. Once that's done the size of a page is of no concern. The access is for a byte/word at a physical address. If this causes a cache miss occurs then the cache block that will be read will be of the size of a cache block that covers the physical memory address that's being accessed.
TLB miss will require a page translation by reading other memory. This process can occur in hardware on some CPU (such as Intel x86/x64) or need to be handled in software. Once the page translation has been completed the TLB will be reloaded with the page translation.
TLB miss does not imply cache miss. TLB miss just means the virtual to physical address mapping was not known and required a page address translation to occur. A cache miss means the physical memory content could not be provided quickly.
To recap:
the TLB is to convert virtual addresses to physical address quickly. It exist to cache the virtual to physical memory mapping quickly. It does not have anything to do with physical memory content.
the cache is to allow faster access to memory. It is only there to provide the content of physical memory faster.
Keep in mind that the term cache can be used for lots of purposes (e.g. note the usage of cache when describing the TLB). TLB is a bit more specific and usually implies a virtual memory translation though that's not universal. For example some DMA controllers have a TLB too but that TLB is not necessarily used to translate virtual to physical addresses but rather to convert block addresses to physical addresses.

DMA cache coherent contiguous allocation

I was looking at allocating a buffer for a DMA transaction. I read that there are two ways of doing this - coherent mappings or streaming mappings.
Now, we want cache coherent buffers. However, we are not doing any scatter/gather so we want the benefits of the dma_map_single call. We have set aside some contiguous memory in the bootargs so we always have enough contiguous memory available.
So I was wondering, can we call dma_alloc_coherent and then dma_map_single after it with the virtual address that dma_alloc_coherent returns? The returned physical address of the map single would then be set as the dma handle that dma_alloc_coherent returned in its call.
Does this make sense? Or is it superfluous/incorrect to do it this way?
You can't use both. But you're also trying to solve a nonexistent problem. Just use dma_alloc_coherent(), which gives you a contiguous DMA memory buffer with both virtual and physical addresses. DMA to the physical address, access it from the CPU with the virtual address. What's the issue?

Requesting memory more than available

Suppose there is 50 MB to allocate and there is this for loop which allocates memory in each iteration. What happens when the following for loop runs.
for(int i=0; i < 20; i++)
{
int *p = malloc(5MB);
}
I was asked this question in an interview. Can someone guide me on this and direct me to the necessary topics which I must learn in order to understand such situations.
If this is a system utilising virtual memory then the situation is obviously more complex than malloc simply failing and returning null pointers. In this case the malloc call will result in allocation of pages of memory in the virtual address space. When this memory is accessed a page fault will result, control will be given to the os memory manager, and this will map the virtual memory pages to pages of physical memory. When the physical memory available is full then the memory manager will generally handle further page faults by writing data which is currently in physical memory into disk backing (or simply discarding this data if it is already backed by a disk file), and then remapping this now available physical memory for the virtual memory page which originally resulted in the page fault. Any attempts to access virtual address pages which have previously been written out to disk backing will result in a similar process occurring.
Wikipedia contains a reasonably basic overview of this process (http://en.wikipedia.org/wiki/Paging), including some implementation details for different os's. More detailed information is available from a number of other sources, eg. the intel architectures software development manuals (http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html)
According to documentation for malloc (eg. http://www.cplusplus.com/reference/cstdlib/malloc/) you will get a pointer to a memory block of 5MB (noting that the argument for malloc should be the number of bytes required, not MB..) the first 10 times (maybe 9 times if there is overhead associated with the malloc'ed space which is taken out of the 50MB available), and a pointer to this space will be returned. Following this further 5MB chunks of memory will not be available, and malloc will fail, returning a null pointer.

memory allocation vs. swapping (under Windows)

sorry for my rather general question, but I could not find a definite answer to it:
Given that I have free swap memory left and I allocate memory in reasonable chunks (~1MB) -> can memory allocation still fail for any reason?
The smartass answer would be "yes, memory allocation can fail for any reason". That may not be what you are looking for.
Generally, whether your system has free memory left is not related to whether allocations succeed. Rather, the question is whether your process address space has free virtual address space.
The allocator (malloc, operator new, ...) first looks if there is free address space in the current process that is already mapped, that is, the kernel is aware that the addresses should be usable. If there is, that address space is reserved in the allocator and returned.
Otherwise, the kernel is asked to map new address space to the process. This may fail, but generally doesn't, as mapping does not imply using physical memory yet -- it is just a promise that, should someone try to access this address, the kernel will try to find physical memory and set up the MMU tables so the virtual->physical translation finds it.
When the system is out of memory, there is no physical memory left, the process is suspended and the kernel attempts to free physical memory by moving other processes' memory to disk. The application does not notice this, except that executing a single assembler instruction apparently took a long time.
Memory allocations in the process fail if there is no mapped free region large enough and the kernel refuses to establish a mapping. For example, not all virtual addresses are useable, as most operating systems map the kernel at some address (typically, 0x80000000, 0xc0000000, 0xe0000000 or something such on 32 bit architectures), so there is a per-process limit that may be lower than the system limit (for example, a 32 bit process on Windows can only allocate 2 GB, even if the system is 64 bit). File mappings (such as the program itself and DLLs) further reduce the available space.
A very general and theoretical answer would be no, it can not. One of the reasons it could possibly under very peculiar circumstances fail is that there would be some weird fragmentation of your available / allocatable memory. I wonder whether you're trying get (probably very minor) performance boost (skipping if pointer == NULL - kind of thing) or you're just wondering and want to discuss it, in which case you should probably use chat.
Yes, memory allocation often fails when you run out of memory space in a 32-bit application (can be 2, 3 or 4 GB depending on OS version and settings). This would be due to a memory leak. It can also fail if your OS runs out of space in your swap file.

Why is the kernel concerned about issuing PHYSICALLY contiguous pages?

When a process requests physical memory pages from the Linux kernel, the kernel does its best to provide a block of pages that are physically contiguous in memory. I was wondering why it matters that the pages are PHYSICALLY contiguous; after all, the kernel can obscure this fact by simply providing pages that are VIRTUALLY contiguous.
Yet the kernel certainly tries its hardest to provide pages that are PHYSICALLY contiguous, so I'm trying to figure out why physical contiguity matters so much. I did some research &, across a few sources, uncovered the following reasons:
1) makes better use of the cache & achieves lower avg memory access times (GigaQuantum: I don’t understand: how?)
2) you have to fiddle with the kernel page tables in order to map pages that AREN’T physically contiguous (GigaQuantum: I don’t understand this one: isn’t each page mapped separately? What fiddling has to be done?)
3) mapping pages that aren’t physically contiguous leads to greater TLB thrashing (GigaQuantum: I don’t understand: how?)
Per the comments I inserted, I don't really understand these 3 reasons. Nor did any of my research sources adequately explain/justify these 3 reasons. Can anyone explain these in a little more detail?
Thanks! Will help me to better understand the kernel...
The main answer really lies in your second point. Typically, when memory is allocated within the kernel, it isn't mapped at allocation time - instead, the kernel maps as much physical memory as it can up-front, using a simple linear mapping. At allocation time it just carves out some of this memory for the allocation - since the mapping isn't changed, it has to already be contiguous.
The large, linear mapping of physical memory is efficient: both because large pages can be used for it (which take up less space for page table entries and less TLB entries), and because altering the page tables is a slow process (so you want to avoid doing this at allocation/deallocation time).
Allocations that are only logically linear can be requested, using the vmalloc() interface rather than kmalloc().
On 64 bit systems the kernel's mapping can encompass the entireity of physical memory - on 32 bit systems (except those with a small amount of physical memory), only a proportion of physical memory is directly mapped.
Actually the behavior of memory allocation you describe is common for many OS kernels and the main reason is kernel physical pages allocator. Typically, kernel has one physical pages allocator that is used for allocation of pages for both kernel space (including pages for DMA) and user space. In kernel space you need continuos memory, because it's expensive (for in-kernel code) to map pages every time you need them. On x86_64, for example, it's completely worthless because kernel can see the whole address space (on 32bit systems there's 4G limitation of virtual address space, so typically top 1G are dedicated to kernel and bottom 3G to user-space).
Linux kernel uses buddy algorithm for page allocation, so that allocation of bigger chunk takes fewer iterations than allocation of smaller chunk (well, smaller chunks are obtained by splitting bigger chunks). Moreover, using of one allocator for both kernel space and user space allows the kernel to reduce fragmentation. Imagine that you allocate pages for user space by 1 page per iteration. If user space needs N pages, you make N iterations. What happens if kernel wants some continuos memory then? How can it build big enough continuos chunk if you stole 1 page from each big chunk and gave them to user space?
[update]
Actually, kernel allocates continuos blocks of memory for user space not as frequently as you might think. Sure, it allocates them when it builds ELF image of a file, when it creates readahead when user process reads a file, it creates them for IPC operations (pipe, socket buffers) or when user passes MAP_POPULATE flag to mmap syscall. But typically kernel uses "lazy" page loading scheme. It gives continuos space of virtual memory to user-space (when user does malloc first time or does mmap), but it doesn't fill the space with physical pages. It allocates pages only when page fault occurs. The same is true when user process does fork. In this case child process will have "read-only" address space. When child modifies some data, page fault occurs and kernel replaces the page in child address space with a new one (so that parent and child have different pages now). Typically kernel allocates only one page in these cases.
Of course there's a big question of memory fragmentation. Kernel space always needs continuos memory. If kernel would allocate pages for user-space from "random" physical locations, it'd be much more hard to get big chunk of continuos memory in kernel after some time (for example after a week of system uptime). Memory would be too fragmented in this case.
To solve this problem kernel uses "readahead" scheme. When page fault occurs in an address space of some process, kernel allocates and maps more than one page (because there's possibility that process will read/write data from the next page). And of course it uses physically continuos block of memory (if possible) in this case. Just to reduce potential fragmentation.
A couple of that I can think of:
DMA hardware often accesses memory in terms of physical addresses. If you have multiple pages worth of data to transfer from hardware, you're going to need a contiguous chunk of physical memory to do so. Some older DMA controllers even require that memory to be located at low physical addresses.
It allows the OS to leverage large pages. Some memory management units allow you to use a larger page size in your page table entries. This allows you to use fewer page table entries (and TLB slots) to access the same quantity of virtual memory. This reduces the likelihood of a TLB miss. Of course, if you want to allocate a 4MB page, you're going to need 4MB of contiguous physical memory to back it.
Memory-mapped I/O. Some devices could be mapped to I/O ranges that require a contiguous range of memory that spans multiple frames.
Contiguous or Non-Contiguous Memory Allocation request from the kernel depends on your application.
E.g. of Contiguous memory allocation: If you require a DMA operation to be performed then you will be requesting the contiguous memory through kmalloc() call as DMA operation requires a memory which is also physically contiguous , as in DMA you will provide only the starting address of the memory chunk and the other device will read or write from that location.
Some of the operation do not require the contiguous memory so you can request a memory chunk through vmalloc() which gives the pointer to non contagious physical memory.
So it is entirely dependent on the application which is requesting the memory.
Please remember that it is a good practice that if you are requesting the contiguous memory than it should be need based only as kernel is trying best to allocation the memory which is physically contiguous.Well kmalloc() and vmalloc() has their limits also.
Placing things we are going to be reading a lot physically close together takes advantage of spacial locality, things we need are more likely to be cached.
Not sure about this one
I believe this means if pages are not contiguous, the TLB has to do more work to find out where they all are. If they are contigous, we can express all the pages for a processes as PAGES_START + PAGE_OFFSET. If they aren't, we need to store a seperate index for all of the pages of a given processes. Because the TLB has a finite size and we need to access more data, this means we will be swapping in and out a lot more.
kernel does not need physically contiguous pages actually it just needs efficencies ans stabilities.
monolithic kernel tends to have one page table for kernel space shared among processes
and does not want page faults on kernel space that makes kernel designs too complex
so usual implementations on 32 bit architecture is always 3g/1g split for 4g address space
for 1g kernel space, normal mappings of code and data should not generate recursive page faults that is too complex to manage:
you need to find empty page frames, create mapping on mmu, and handle tlb flush for new mappings on every kernel side page fault
kernel is already busy of doing user side page faults
furthermore, 1:1 linear mapping could have much less page table entries because it can utilize bigger size of page unit (>4kb)
less entries leads to less tlb misses.
so buddy allocator on kernel linear address space always provides physically contiguous page frames
even most codes doesn't need contiguous frames
but many device drivers which need contiguous page frames already believe that allocated buffers through general kernel allocator are physically contiguous

Resources