I am reading Linux Kernel Development by Robert Love. I don't understand this paragraph about the bio structure:
The basic container for block I/O within the kernel is the bio structure, which is defined in <linux/bio.h>. This structure represents block I/O operations that are in flight (active) as a list of segments. A segment is a chunk of a buffer that is contiguous in memory. Thus, individual buffers need not be contiguous in memory. By
allowing the buffers to be described in chunks, the bio structure provides the capability for the kernel to perform block I/O operations of even a single buffer from multiple locations in memory. Vector I/O such as this is called scatter-gather I/O.
What exactly does flight(active) means?
"As a list of segments" -- are we talking about this segmentation?
What does "By allowing the buffers ... in memory" mean?
Block Devices are such device which deals with a chunk (512, 1024 bytes) of data during an I/O transaction. "struct bio" is available for block I/O operations from Kernel-Space. This structure is commonly used in block device driver development.
Q1) What exactly does flight(active) means?
Block devices are usually implemented with a File-System meant for storing files. Now when ever an user-space application initiates a File I/O operation (read, write), the kernel in turn initiates a sequence of Block I/O operation through File-System Manager. The "struct bio" keeps track of all Block I/O transactions (initiated by user app) that is to be processed. That's what is mentioned here as flight/active regions.
"Q2) As a list of segments" -- are we talking about this segmentation?
Memory buffers are required by the kernel to hold data to/from Block device.
In kernel there are two possiblilites in which the memory is allocated.
Virtual Address Continuous - Physical Address Continuous (Using kmalloc() - Provides good Performance but limited in size)
Virtual Address Continuous - Physical Address Non-continuous (Using vmalloc() - For huge memory size requirement)
Here a segment indicates the first type i.e. continuous physical memory which is used for block IO transfer. List of segment indicates a set of such continuous physical memory regions. Note that the list elements are non-continuous memory segments.
Q3) What does "By allowing the buffers ... in memory" mean?
Scatter-gather is feature which allows data transfer from/to multiple non-continuous memory location to/from device, in a single shot (read/write transaction). Here "struct bio" keeps record of multiple segments that is to be processed. Each segment is a continuous memory region whereas multiple segments are non-continuous with one another. "struct bio" provides capability to the kernel to perform scatter-gather feature.
"In flight" means an operation that has been requested, but hasn't been initiated yet.
"Segment" here means a range of memory to be read or written, a contiguous
piece of the data to be transferred as part of the operation.
"Scatter/gather I/O" is meant by scatter operations that take a contiguous range of data on disk and distributes pieces of it into memory, gather takes separate ranges of data in memory and writes them contiguously to disk. (Replace "disk" by some suitable device in the preceding.) Some I/O machinery is able to do this in one operation (and this is getting more common).
1) "In flight" means "in progress"
2) No
3) Not quite sure :)
Related
This is my first post. I want to ask about how are virtual memory related to paging and segmentation. I am searching internet for few days, but still can't manage to put that information into right order. Here is what I know so far:
We can talk about addresses (we could say they are levels of memory abstraction) in memory:
physical level (CPU talking to memory controller, "hey give me contents of address 0xFFEABCD", these adresses are adresses of cells in RAM, so cell 0xABCD has physical address 0xABCD. memory controller can only use physical adresses, so if adress is not physical it must be changed to physical.
logical level.This is abstraction over physical addresses. Here processes if ask for memory, (assume successfull allocation) are given address which has no direct relation to cells in RAM. We can say these addresses are from different pool (world?) than physical addresses. As I said before memory controller only understand physical adresses, so to use logical addresses, we need to convert them to physical addresses. There are two ways for OS to be able to create logical adresses:
paging - in which physical memory (RAM) is divided into continous blocks of memory (called frames), and logical memory (this other world) is also divided in same in length blocks (called pages). Now OS keep in RAM data structure called page table. It's an associative array (map) and it's primary goal of existence is to translate logical level addresses to physical level adresses. Paging has following effect: memory allocated by process in RAM (so in frames in physical memory belonging to program) may not be in contingous manner (so there may be holes inside).
segmentation - program is divided into parts called segments. Segments sizes are not fixed, so different segments may have different sizes. Program is divided in few segments and each segment will have its own place in RAM (physical) memory. So one segment (call it sementA), and another (call it segmentB) may not be near each other. In other words segmentA don't have to has segmentB as a neighbour.
internal fragmentation - when memory which belongs to process isn't used in 100%. So if process want to have 2 bytes for its use, OS need to allocate page/pages which total size need to be greater or equal than amount of memory requested by program. Typical size of page is 4KB. Unit in which OS gives memory to process are pages. So it can't give less than 4KB. So if we use 2 bytes, 4KB - 2B = 4094 bytes are wasted (memory is associated with our process so other processes can't use it. Only we can use it, but we only need 2B).
external fragmentation - when allocated blocks of memory are one near another, but there is a little hole between them. Its free, so other programs, can use it, but it is unlikly because it is very small. That holes with high probability will be wasted. More holes - more wasted memory.
Paging may cause effect of internal fragmentation. Segmentation may cause effect of external fragmentation.
virtual level - addresses used in virtual memory. This is extension of logical memory level. Now program don't even need to have all of it's allocated pages in RAM to start execution. It can be implemented with following techniques:
paged segmentation - method in which segments are divided into pages.
segmented paging - less used method but also possible.
Combining them takes a positive aspects from both solutions.
What i have read about pros and cons of virtual memory:
PROS:
processes have their own address space which mean if we have two processes A and B, and both of them have a pointer to address eg. 17 processA pointer will be showing to different frame than pointer in processB. this results in greater process isolation. Processes are protected from each other (so one process can't do things with another process memory if it isn't shared memory because in its mapping don't exist such mapping entry), and OS is more protected from processes.
have more memory than you physical first order memory(RAM, due to swapping to secondary order memory).
better use of memory due to:
swapping unused parts of programs to secondary memory.
making sharings pages possible, also make possible "copy on write".
improved multiprogram capability (when not needed parts of programs are swapped out to secondary memory, they made free space in ram which could be used for new procesess.)
improved CPU utilisation (if you can have more processes loaded into memory you have bigger probability than there exist some program that now need do CPU stuff, not IO stuff. In such cases you can better utilise CPU).
CONS:
virtual memory has it's overhead because we need to get access to memory twice (but here a lot of improvment can be achieved using TLB buffers)
it makes OS part managing memory more complicated.
So here we came to parts which I don't really understand:
Why in some sources logical address and virtual addresses are described as synonymes? Do I get something wrong?
Is really virtual memory making protection to processes? I mean, in segmentation for example there was also check if process do not acces other memory (resulting in segfault if it does), paging also has a protection bit in a page table, so doesn't the protection come from simply extending abstraction of logic level addresses? If VM (Virtual Memory) brings extended protection features, what are they and how they work? In other words: does creating separate address space for each process, bring extended memory protection. If so, what can't be achieved is paging without VM?
How really differ paged segmentation from segmented paging. I know that the difference between these two will be how a address is constructed (a page number, segment number, that stuff..), but I suppose it isn't enough to develop 2 strategies. This reason is like nothing. I read that segmented paging is less elastic, and that's the reason why it is rarely used. But why it it less elastic? Is the reason for that, that in program you can have only few segments instead a lot of pages. If thats the case paging indeed allow better "granularity".
If VM make separate address space for each process, does it mean, paging without VM use logic addresses from "one pool" (is then every logic address globally unique in that case?).
Any help on that topic would be appreciated.
Edit: #1
Ok. I finally understood that paging not on demand is also a virtual memory. I just found some clarification was helpful to understand the topic. Below is link to image which I made to visualize differences. Thanks for help.
differences between paging, demand paging and swapping
Why in some sources logical address and virtual addresses are described as synonymes? Do I get something wrong?
Many sources conflate logical and virtual memory translation. In ye olde days, logical address translation never took place without virtual address translation so processor documentation referred to them as the same.
Now we have large memory systems that use logical memory translation without virtual memory.
Is really virtual memory making protection to processes?
It is the logical memory translation that implements page protections.
How really differ paged segmentation from segmented paging.
You can really ignore segments. No rationally designed processor architecture designed after 1970 used segments and they are finally dying out.
If VM make separate address space for each process, does it mean, paging without VM use logic addresses from "one pool"
It is logical memory that creates the separate address space for each process. Paging is virtual memory. You cannot have one without the other.
My group (a project called Isis2) is experimenting with RDMA. We're puzzled by the lack of documentation for the atomicity guarantees of one-sided RDMA reads. I've spent the past hour and a half hunting for any kind of information at all on this to no avail. This includes close reading of the blog at rdmamojo.com, famous for having answers to every RDMA question...
In the case we are focused on, we want to have writers doing atomic writes for objects that will always fit within a single cache line. Say this happens on machine A. Then we plan to have a one-sided atomic RDMA reader on machine B, who might read chunks of memory from A, spanning many of these objects (but again, no object would ever be written non-atomically, and all will fit within some single cache line). So B reads X, Y and Z, and each of those objects lives in one cache line on A, and was written with atomic writes.
Thus the atomic writes will be local, but the RDMA reads will arrive from remote machines and are done with no local CPU involvement.
Are our one-sided reads "semantically equivalent" to atomic local reads despite being initiated on the remote machine? (I suspect so: otherwise, one-sided RDMA reads would be useless for data that is ever modified...). And where are the "rules" documented?
Ok, meanwhile I seem to have found the correct answer, and I believe that Roland's response is not quite right -- partly right but not entirely.
In http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf, which is the Intel architecture manual (I'll need to check again for AMD...) I found this: Atomic memory operation in Intel 64 and IA-32 architecture is guaranteed only for a subset of memory operand
sizes and alignment scenarios. The list of guaranteed atomic operations are described in Section 8.1.1 of IA-32
Intel® Architecture Software Developer’s Manual, Volumes 3A.
Then in that section, which is entitled MULTIPLE-PROCESSOR MANAGEMENT, one finds a lot of information about guaranteed atomic operations (page 2210). In particular, Intel guarantees that its memory subsystems will be atomic for native types (bit, byte, integers of various sizes, float). These objects must be aligned so as to fit within a cache line (64 bytes on the current Intel platforms), not crossing a cache line boundary. But then Intel guarantees that no matter what device is using the memory bus, stores and fetches will be atomic.
For more complex objects, locking is required if you want to be sure you will get a safe execution. Further, if you are doing multicore operations you have to use the locked (atomic) variants of the Intel instructions to be sure of coherency for concurrent writes. You get this automatically for variables marked volatile in C++ or C# (Java too?).
What this adds up to is that local writes to native types can be paired with remotely initiated RDMA reads safely.
But notice that strings, byte arrays -- those would not be atomic because they could easily cross a cache line. Also, operations on complex objects with more than one data field might not be atomic -- for such things you would need a more complex approach, such as the one in the FaRM paper (Fast Remote Memory) by MSR. My own need is simpler and won't require the elaborate version numbering scheme FaRM implements...
The cache coherence protocol implemented in the PCIe controller should guarantee atomicity for single cache line RDMA reads. The PCIe controller has to snoop the caches of CPU cores and take ownership of the cache line (RFO) before returning data to the RDMA adapter. So it should see some snapshot of the cache line.
I don't know of any such guarantee of atomicity. Of course RDMA reads are executed by the remote adapter, and cacheline size is a CPU concept. I don't believe anything ensures that the granularity of reads used by remote RDMA adapter matches the size of writes performed by the remote CPU.
In practice it is likely to work since the remote adapter will probably issue a single PCI transaction etc. but I don't think there is anything architectural that guarantees you don't get "torn" data.
I'm reading LDD3 and messing up with the kernel source code. Currently, I'm trying to fully understand the struct bio and its usage.
What I have read so far:
https://lwn.net/images/pdf/LDD3/ch16.pdf
http://www.makelinux.net/books/lkd2/ch13lev1sec3
https://lwn.net/Articles/26404/
(a part of) https://www.kernel.org/doc/Documentation/block/biodoc.txt
If I understand correctly, a struct bio describes a request for some blocks to be transferred between a block device and system memory. The rules are that a single struct bio can only refer to a contiguous set of disk sectors but system memory can be non-contiguous and be represented by a vector of <page,len,offset>, right?. That is, a single struct bio requests the reading/writing of bio_sectors(bio) (multitude) sectors, starting with sector bio->bi_sector. The size of data transferred is limited by the actual device, the device driver, and/or the host adapter. I can get that limit by queue_max_hw_sectors(request_queue), right? So, if I keep submitting bios that turn out to be contiguous in disk sectors, the I/O scheduler/elevator will merge these bios into a sigle one, until that limit is reached, right?
Also, bio->size must be a multiple of 512 (or the equivalent sector size) so that bio_sectors(bio) is a whole number, right?
Moreover, these bio_sectors(bio) sectors will be moved to/from system memory, and by memory we mean struct pages. Since there is no specific mapping between <page,len,offset> and disk sectors, I assume that implicitly bio->bi_io_vec are serviced in order or appearence. That is, the first disk sectors (starting at bio->bi_sector) will be written from / read to bio->bi_io_vec[0].bv_page then bio->bi_io_vec[1].pv_page etc. Is that right? If so, should bio_vec->bv_len be always a multiple of sector_size or 512? Since a page is usually 4096bytes, should bv_offset be exactly one of {0,512,1024,1536,...,3584,4096}? I mean, does it make sense for example to request 100bytes to be written on a page starting at offset 200?
Also, what is the meaning of bio.bio_phys_segments and why does it differ from bio.bi_vcnt? bio_phys_segments is defined as "The number of physical segments contained within this BIO". Isn't a triple <page,len,offset> what we call a 'physical segment'?
Lastly, if a struct bio is so complex and powerfull, why do we create lists of struct bio and name them struct request and queue them requests in the request_queue? Why not have a bio_queue for the block device where each struct bio is stored until it is serviced?
I'm a bit confused so any answers or pointers to Documentation will be more than useful! Thank you in advance :)
what is the meaning of bio.bio_phys_segments?
The generic block layer can merge different segments. When the page frames in memory and the chunks of disk data, that are adjacent on the disk, are contiguous then the resultant merge operation creates a larger memory area
which is called physical segment.
Then what is bi_hw_segments?
Yet another merge operation is allowed on architectures that handle the mapping between bus addresses and physical addresses through a dedicated bus circuitry. The memory area resulting from this kind of merge operation is called hardware segment. On the 80 x 86 architecture, which has no such dynamic mapping between bus addresses and physical addresses,hardware segments always coincide with physical segments.
That is, the first disk sectors (starting at bio->bi_sector) will be written from / read to bio->bi_io_vec[0].bv_page then bio->bi_io_vec[1].pv_page etc.
Is that right? If so, should bio_vec->bv_len be always a multiple of sector_size or 512? Since a page is usually 4096bytes, should bv_offset be exactly one of {0,512,1024,1536,...,3584,4096}? I mean, does it make sense for example to request 100bytes to be written on a page starting at offset 200?
The bi_io_vec contains the page frame for the IO. bv_offset is the offset in the page frame. Before actual writing/reading on the disk every thing is mapped to sector as disk deals in sectors. This doesn't imply that length has to be in the multiple of sectors. So this will result into unaligned read/writes which is taken care by underlying device driver.
if a struct bio is so complex and powerfull, why do we create lists of struct bio and name them struct request and queue them requests in the request_queue? Why not have a bio_queue for the block device where each struct bio is stored until it is serviced?
Request queue is per device structure and takes care of flushing. Every block device has its own request queue. And bio structure is generic entity for IO. If you incorporate request_queue featues into bio then you will create a single global bio_queue and that too very heavy structure. Not a good idea. So basically these two structures serve different purposes in context of IO operation.
Hope it helps.
I'm in the middle of writing a framebuffer driver for an SPI connected LCD. I use kmalloc to allocate the buffer, which is quite large - 150KB. Given the way kmalloc is allocating the buffer, ksize reports that way more memory is being used - 256KB or so.
The SPI spi_transfer structure takes pointers to tx and rx buffers, both of which have to be DMA safe. As I want the tx buffer to be about 16KB, can I allocate that buffer within the kmalloced video buffer and still be DMA safe?
This could be considered premature optimisation but there's so much spare space within the video buffer it feels bad not to use it! Essentially there is no difference in allocated memory between:
kmalloc(videosize)
and
kmalloc(PAGE_ALIGN(videosize) + txbufsize)
so one could take the kptr returned and do:
txbuf = (u8 *)kptr + PAGE_ALIGN(videosize);
I'm aware that part of the requirement of "DMA safe" is appropriate alignment - to CPU cacheline size I believe... - but shouldn't a page alignment be ok for this?
As an aside, I'm not sure if tx and rx can point to the same place. The spi.h header is unclear too (explicitly unclear actually). Given that the rx buffer will never be more than a few bytes, it would be silly to make trouble by trying to find out!
The answer appears to be yes with provisos. (Specifically that "it's more complicated than that")
If you acquire your memory via __get_free_page*() or the generic memory allocator (kmalloc) then you may DMA to/from that memory using the addresses returned from those routines. The underlying implication is that a page aligned buffer within kmalloc, even spanning multiple pages, will be DMA safe as the underlying physical memory is guaranteed to be contiguous and a page aligned buffer is guaranteed to be on a cache line boundary.
One proviso is whether the device is capable of driving the full bus width (eg: ISA). Thus, the physical address of the memory must be within the dma_mask of the device.
Another is cache coherency requirements. These operates at the granularity of the cache line width. To prevent two seperate memory regions from sharing one cache line, the memory for dma must begin exactly on a cache line boundary and end exactly on one. Given that this may not be known, it is recommended (DMA API documentation) to only map virtual regions that begin and end on page boundaries (as these are guaranteed also to be cache line boundaries as stated above).
A DMA driver can use dma_alloc_coherent() to allocate DMA-able space in this case to guarantee that the DMA region is uncacheable. As this may be expensive, a streaming method also exists - for one way communication - where coherency is limited to cache flushes on write. Use dma_map_single() on a previously allocated buffer.
In my case, passing the tx and rx buffers to spi_sync without dma_map_single is fine - the spi routines will do it for me. I could use dma_map_single myself along with either unmap or dma_sync_single_for_cpu() to keep everything in sync. I won't bother at the moment though - performance tweaking after the driver works is a better strategy.
See also:
Does every dma_map_single call require a corresponding dma_unmap_single?
Linux kernel device driver to DMA into kernel space
When a process requests physical memory pages from the Linux kernel, the kernel does its best to provide a block of pages that are physically contiguous in memory. I was wondering why it matters that the pages are PHYSICALLY contiguous; after all, the kernel can obscure this fact by simply providing pages that are VIRTUALLY contiguous.
Yet the kernel certainly tries its hardest to provide pages that are PHYSICALLY contiguous, so I'm trying to figure out why physical contiguity matters so much. I did some research &, across a few sources, uncovered the following reasons:
1) makes better use of the cache & achieves lower avg memory access times (GigaQuantum: I don’t understand: how?)
2) you have to fiddle with the kernel page tables in order to map pages that AREN’T physically contiguous (GigaQuantum: I don’t understand this one: isn’t each page mapped separately? What fiddling has to be done?)
3) mapping pages that aren’t physically contiguous leads to greater TLB thrashing (GigaQuantum: I don’t understand: how?)
Per the comments I inserted, I don't really understand these 3 reasons. Nor did any of my research sources adequately explain/justify these 3 reasons. Can anyone explain these in a little more detail?
Thanks! Will help me to better understand the kernel...
The main answer really lies in your second point. Typically, when memory is allocated within the kernel, it isn't mapped at allocation time - instead, the kernel maps as much physical memory as it can up-front, using a simple linear mapping. At allocation time it just carves out some of this memory for the allocation - since the mapping isn't changed, it has to already be contiguous.
The large, linear mapping of physical memory is efficient: both because large pages can be used for it (which take up less space for page table entries and less TLB entries), and because altering the page tables is a slow process (so you want to avoid doing this at allocation/deallocation time).
Allocations that are only logically linear can be requested, using the vmalloc() interface rather than kmalloc().
On 64 bit systems the kernel's mapping can encompass the entireity of physical memory - on 32 bit systems (except those with a small amount of physical memory), only a proportion of physical memory is directly mapped.
Actually the behavior of memory allocation you describe is common for many OS kernels and the main reason is kernel physical pages allocator. Typically, kernel has one physical pages allocator that is used for allocation of pages for both kernel space (including pages for DMA) and user space. In kernel space you need continuos memory, because it's expensive (for in-kernel code) to map pages every time you need them. On x86_64, for example, it's completely worthless because kernel can see the whole address space (on 32bit systems there's 4G limitation of virtual address space, so typically top 1G are dedicated to kernel and bottom 3G to user-space).
Linux kernel uses buddy algorithm for page allocation, so that allocation of bigger chunk takes fewer iterations than allocation of smaller chunk (well, smaller chunks are obtained by splitting bigger chunks). Moreover, using of one allocator for both kernel space and user space allows the kernel to reduce fragmentation. Imagine that you allocate pages for user space by 1 page per iteration. If user space needs N pages, you make N iterations. What happens if kernel wants some continuos memory then? How can it build big enough continuos chunk if you stole 1 page from each big chunk and gave them to user space?
[update]
Actually, kernel allocates continuos blocks of memory for user space not as frequently as you might think. Sure, it allocates them when it builds ELF image of a file, when it creates readahead when user process reads a file, it creates them for IPC operations (pipe, socket buffers) or when user passes MAP_POPULATE flag to mmap syscall. But typically kernel uses "lazy" page loading scheme. It gives continuos space of virtual memory to user-space (when user does malloc first time or does mmap), but it doesn't fill the space with physical pages. It allocates pages only when page fault occurs. The same is true when user process does fork. In this case child process will have "read-only" address space. When child modifies some data, page fault occurs and kernel replaces the page in child address space with a new one (so that parent and child have different pages now). Typically kernel allocates only one page in these cases.
Of course there's a big question of memory fragmentation. Kernel space always needs continuos memory. If kernel would allocate pages for user-space from "random" physical locations, it'd be much more hard to get big chunk of continuos memory in kernel after some time (for example after a week of system uptime). Memory would be too fragmented in this case.
To solve this problem kernel uses "readahead" scheme. When page fault occurs in an address space of some process, kernel allocates and maps more than one page (because there's possibility that process will read/write data from the next page). And of course it uses physically continuos block of memory (if possible) in this case. Just to reduce potential fragmentation.
A couple of that I can think of:
DMA hardware often accesses memory in terms of physical addresses. If you have multiple pages worth of data to transfer from hardware, you're going to need a contiguous chunk of physical memory to do so. Some older DMA controllers even require that memory to be located at low physical addresses.
It allows the OS to leverage large pages. Some memory management units allow you to use a larger page size in your page table entries. This allows you to use fewer page table entries (and TLB slots) to access the same quantity of virtual memory. This reduces the likelihood of a TLB miss. Of course, if you want to allocate a 4MB page, you're going to need 4MB of contiguous physical memory to back it.
Memory-mapped I/O. Some devices could be mapped to I/O ranges that require a contiguous range of memory that spans multiple frames.
Contiguous or Non-Contiguous Memory Allocation request from the kernel depends on your application.
E.g. of Contiguous memory allocation: If you require a DMA operation to be performed then you will be requesting the contiguous memory through kmalloc() call as DMA operation requires a memory which is also physically contiguous , as in DMA you will provide only the starting address of the memory chunk and the other device will read or write from that location.
Some of the operation do not require the contiguous memory so you can request a memory chunk through vmalloc() which gives the pointer to non contagious physical memory.
So it is entirely dependent on the application which is requesting the memory.
Please remember that it is a good practice that if you are requesting the contiguous memory than it should be need based only as kernel is trying best to allocation the memory which is physically contiguous.Well kmalloc() and vmalloc() has their limits also.
Placing things we are going to be reading a lot physically close together takes advantage of spacial locality, things we need are more likely to be cached.
Not sure about this one
I believe this means if pages are not contiguous, the TLB has to do more work to find out where they all are. If they are contigous, we can express all the pages for a processes as PAGES_START + PAGE_OFFSET. If they aren't, we need to store a seperate index for all of the pages of a given processes. Because the TLB has a finite size and we need to access more data, this means we will be swapping in and out a lot more.
kernel does not need physically contiguous pages actually it just needs efficencies ans stabilities.
monolithic kernel tends to have one page table for kernel space shared among processes
and does not want page faults on kernel space that makes kernel designs too complex
so usual implementations on 32 bit architecture is always 3g/1g split for 4g address space
for 1g kernel space, normal mappings of code and data should not generate recursive page faults that is too complex to manage:
you need to find empty page frames, create mapping on mmu, and handle tlb flush for new mappings on every kernel side page fault
kernel is already busy of doing user side page faults
furthermore, 1:1 linear mapping could have much less page table entries because it can utilize bigger size of page unit (>4kb)
less entries leads to less tlb misses.
so buddy allocator on kernel linear address space always provides physically contiguous page frames
even most codes doesn't need contiguous frames
but many device drivers which need contiguous page frames already believe that allocated buffers through general kernel allocator are physically contiguous