linux kernel, struct bio : how pages are read/written

linux kernel, struct bio : how pages are read/written - linux-kernel

I'm reading LDD3 and messing up with the kernel source code. Currently, I'm trying to fully understand the struct bio and its usage.
What I have read so far:
https://lwn.net/images/pdf/LDD3/ch16.pdf
http://www.makelinux.net/books/lkd2/ch13lev1sec3
https://lwn.net/Articles/26404/
(a part of) https://www.kernel.org/doc/Documentation/block/biodoc.txt
If I understand correctly, a struct bio describes a request for some blocks to be transferred between a block device and system memory. The rules are that a single struct bio can only refer to a contiguous set of disk sectors but system memory can be non-contiguous and be represented by a vector of <page,len,offset>, right?. That is, a single struct bio requests the reading/writing of bio_sectors(bio) (multitude) sectors, starting with sector bio->bi_sector. The size of data transferred is limited by the actual device, the device driver, and/or the host adapter. I can get that limit by queue_max_hw_sectors(request_queue), right? So, if I keep submitting bios that turn out to be contiguous in disk sectors, the I/O scheduler/elevator will merge these bios into a sigle one, until that limit is reached, right?
Also, bio->size must be a multiple of 512 (or the equivalent sector size) so that bio_sectors(bio) is a whole number, right?
Moreover, these bio_sectors(bio) sectors will be moved to/from system memory, and by memory we mean struct pages. Since there is no specific mapping between <page,len,offset> and disk sectors, I assume that implicitly bio->bi_io_vec are serviced in order or appearence. That is, the first disk sectors (starting at bio->bi_sector) will be written from / read to bio->bi_io_vec[0].bv_page then bio->bi_io_vec[1].pv_page etc. Is that right? If so, should bio_vec->bv_len be always a multiple of sector_size or 512? Since a page is usually 4096bytes, should bv_offset be exactly one of {0,512,1024,1536,...,3584,4096}? I mean, does it make sense for example to request 100bytes to be written on a page starting at offset 200?
Also, what is the meaning of bio.bio_phys_segments and why does it differ from bio.bi_vcnt? bio_phys_segments is defined as "The number of physical segments contained within this BIO". Isn't a triple <page,len,offset> what we call a 'physical segment'?
Lastly, if a struct bio is so complex and powerfull, why do we create lists of struct bio and name them struct request and queue them requests in the request_queue? Why not have a bio_queue for the block device where each struct bio is stored until it is serviced?
I'm a bit confused so any answers or pointers to Documentation will be more than useful! Thank you in advance :)

what is the meaning of bio.bio_phys_segments?
The generic block layer can merge different segments. When the page frames in memory and the chunks of disk data, that are adjacent on the disk, are contiguous then the resultant merge operation creates a larger memory area
which is called physical segment.
Then what is bi_hw_segments?
Yet another merge operation is allowed on architectures that handle the mapping between bus addresses and physical addresses through a dedicated bus circuitry. The memory area resulting from this kind of merge operation is called hardware segment. On the 80 x 86 architecture, which has no such dynamic mapping between bus addresses and physical addresses,hardware segments always coincide with physical segments.
That is, the first disk sectors (starting at bio->bi_sector) will be written from / read to bio->bi_io_vec[0].bv_page then bio->bi_io_vec[1].pv_page etc.
Is that right? If so, should bio_vec->bv_len be always a multiple of sector_size or 512? Since a page is usually 4096bytes, should bv_offset be exactly one of {0,512,1024,1536,...,3584,4096}? I mean, does it make sense for example to request 100bytes to be written on a page starting at offset 200?
The bi_io_vec contains the page frame for the IO. bv_offset is the offset in the page frame. Before actual writing/reading on the disk every thing is mapped to sector as disk deals in sectors. This doesn't imply that length has to be in the multiple of sectors. So this will result into unaligned read/writes which is taken care by underlying device driver.
if a struct bio is so complex and powerfull, why do we create lists of struct bio and name them struct request and queue them requests in the request_queue? Why not have a bio_queue for the block device where each struct bio is stored until it is serviced?
Request queue is per device structure and takes care of flushing. Every block device has its own request queue. And bio structure is generic entity for IO. If you incorporate request_queue featues into bio then you will create a single global bio_queue and that too very heavy structure. Not a good idea. So basically these two structures serve different purposes in context of IO operation.
Hope it helps.

Related

VirtualAlloc a writeable, unbacked "throw-away" garbage range?

Is it possible in Win32 to get a writeable (or write-only) range of "garbage" virtual address space (i.e., via VirtualAlloc, VirtualAlloc2, VirtualAllocEx, or other) that never needs to be persisted, and thus ideally is never backed by physical memory or pagefile?
This would be a "hole" in memory.
The scenario is for simulating a dry-run of a sequential memory writing operation just in order to obtain the size that it actually consumes. You would be able to use the exact same code used for actual writing, but instead pass in an un-backed "garbage" address range that essentially ignores or discards anything that's written to it. In this example, the size of the "void" address range could be 2⁶⁴ = 18.4ᴇʙ (why not? it's nothing, after all), and all you're interested in is the final value of an advancing pointer.
[edit:] see comments section for the most clever answer. Namely: map a single 4K page multiple times in sequence, tiling the entire "empty" range

This isn't possible. If you have code that attempts to write to memory then the virtual memory needs to be backed with something.
However, if you modified your code to use the stream pattern then you could provide a stream implementation that ignored the write and just tracked the size.

map a buffer from Kernel to User space allocated by another module

I am developing a Linux kernel driver on 3.4. The purpose of this driver is to provide a mmap interface to Userspace from a buffer allocated in an other kernel module likely using kzalloc() (more details below). The pointer provided by mmap must point to the first address of this buffer.
I get the physical address from virt_to_phys(). I give this address right shifted by PAGE_SHIFT to remap_pfn_range() in my mmap fops call.
It is working for now but it looks to me that I am not doing the things properly because nothing ensure me that my buffer is at the top of the page (correct me if I am wrong). Maybe mmap()ing is not the right solution? I have already read the chapter 15 of LDD3 but maybe I am missing something?
Details:
The buffer is in fact a shared memory region allocated by the remoteproc module. This region is used within an asymetric multiprocessing design (OMAP4). I can get this buffer thanks to the rproc_da_to_va() call. That is why there is no way to use something like get_free_pages().
Regards
Kev

Yes, you're correct: there is no guarantee that the allocated memory is at the beginning of a page. And no simple way for you to both guarantee that and to make it truly shared memory.
Obviously you could (a) copy the data from the kzalloc'd address to a newly allocated page and insert that into the mmap'ing process' virtual address space, but then it's not shared with the original datum created by the other kernel module.
You could also (b) map the actual page being allocated by the other module into the process' memory map but it's not guaranteed to be on a page boundary and you'd also be sharing whatever other kernel data happened to reside in that page (which is both a security issue and a potential source of kernel data corruption by the user-space process into which you're sharing the page).
I suppose you could (c) modify the memory manager to return every piece of allocated data at the beginning of a page. This would work, but then every time a driver wants to allocate 12 bytes for some small structure, it will in fact be allocating 4K bytes (or whatever your page size is). That's going to waste a huge amount of memory.
There is simply no way to trick the processor into making the memory appear to be at two different offsets within a page. It's not physically possible.
Your best bet is probably to (d) modify the other driver to allocate the specific bits of data that you want to make shared in a way that ensures alignment on a page boundary (i.e. something you write to replace kzalloc).

The bio structure in the Linux kernel

I am reading Linux Kernel Development by Robert Love. I don't understand this paragraph about the bio structure:
The basic container for block I/O within the kernel is the bio structure, which is defined in <linux/bio.h>. This structure represents block I/O operations that are in flight (active) as a list of segments. A segment is a chunk of a buffer that is contiguous in memory. Thus, individual buffers need not be contiguous in memory. By
allowing the buffers to be described in chunks, the bio structure provides the capability for the kernel to perform block I/O operations of even a single buffer from multiple locations in memory. Vector I/O such as this is called scatter-gather I/O.
What exactly does flight(active) means?
"As a list of segments" -- are we talking about this segmentation?
What does "By allowing the buffers ... in memory" mean?

Block Devices are such device which deals with a chunk (512, 1024 bytes) of data during an I/O transaction. "struct bio" is available for block I/O operations from Kernel-Space. This structure is commonly used in block device driver development.
Q1) What exactly does flight(active) means?
Block devices are usually implemented with a File-System meant for storing files. Now when ever an user-space application initiates a File I/O operation (read, write), the kernel in turn initiates a sequence of Block I/O operation through File-System Manager. The "struct bio" keeps track of all Block I/O transactions (initiated by user app) that is to be processed. That's what is mentioned here as flight/active regions.
"Q2) As a list of segments" -- are we talking about this segmentation?
Memory buffers are required by the kernel to hold data to/from Block device.
In kernel there are two possiblilites in which the memory is allocated.
Virtual Address Continuous - Physical Address Continuous (Using kmalloc() - Provides good Performance but limited in size)
Virtual Address Continuous - Physical Address Non-continuous (Using vmalloc() - For huge memory size requirement)
Here a segment indicates the first type i.e. continuous physical memory which is used for block IO transfer. List of segment indicates a set of such continuous physical memory regions. Note that the list elements are non-continuous memory segments.
Q3) What does "By allowing the buffers ... in memory" mean?
Scatter-gather is feature which allows data transfer from/to multiple non-continuous memory location to/from device, in a single shot (read/write transaction). Here "struct bio" keeps record of multiple segments that is to be processed. Each segment is a continuous memory region whereas multiple segments are non-continuous with one another. "struct bio" provides capability to the kernel to perform scatter-gather feature.

"In flight" means an operation that has been requested, but hasn't been initiated yet.
"Segment" here means a range of memory to be read or written, a contiguous
piece of the data to be transferred as part of the operation.
"Scatter/gather I/O" is meant by scatter operations that take a contiguous range of data on disk and distributes pieces of it into memory, gather takes separate ranges of data in memory and writes them contiguously to disk. (Replace "disk" by some suitable device in the preceding.) Some I/O machinery is able to do this in one operation (and this is getting more common).

1) "In flight" means "in progress"
2) No
3) Not quite sure :)

Basic Memory Allocator Info needed- Hoard and SLAB

Recently I have been reading about memory allocator, like Hoard and SLAB. However I didn't get few things:
a. Are these allocator managing physical memory or virtual memory. {If (your answer is Physical Memory) please read point b, else, read point c}
b. If they manage physical memory, since both these alloctor make use of per-CPU data structures, don't they end-up giving space from same physical page to different processes. For e.g. consider T1 starts on CPU C and request for an int. After this let T1 get preempted and T2 starts executing and it also ask for an int. Since our structures are per-CPU, is it not the case that we will end up satisfying both request from same physical page.
c. And if they manage virtual memory, then why we say that all data structures are per-CPU, rather we should say that they are per-process, because on every context switch we have to repopulate these data structures.

dma mapping which does not cross some power-of-2 boundary

I want to setup a DMA mapping for a memory buffer allocated outside my control. dma_map_single appears the right API to use but my HW has a restriction due to which the mapping must not cross some power-of-two boundary say for e.g. 1K. The buffer being mapped is of size less than the boundary value always but otherwise variable. So it looks like DMA pools may not work since they need a fixed size even though the "allocation" part is sort of what I need.
Should I just keep doing dma_map_single and check if mapping meets my requirement and release mapping if it does not? Can this cause same mapping to potentially be returned causing a never ending search? If so, I could hang on to the unfit mappings till a fit one is found and then release all the unfit mappings in one shot. These however don't sound like good ideas.
Does anyone have other/better ideas?
Thanks.

If you can't guarantee that the buffer you are passed meets your criteria, you may need to allocate an auxiliary buffer and copy to/from that buffer before you DMA. On platforms without an IOMMU or other address translation hardware (eg classic x86, ARM, etc), the DMA mapping operation is really just converting to a physical address. So if you unmap and try again with the same buffer, you'll always get back the same DMA address.
On most (all?) other platforms that do have an IOMMU, the translation is still done on chunks >= PAGE_SIZE. In other words, if you're on a platform with 4K pages, and you do DMA mapping on a buffer at 0xABCDExxx, you'll always get a DMA address like 0xFGHIJxxx where the low part of the address "xxx" stays the same. (This is because the IOMMU works like a normal MMU and only looks up the page translation, and leaves the low 12 or whatever bits alone)
So in essentially all cases on all platforms, you can't use the DMA API to fix up the alignment of the buffer you get passed. As I said, I think the only alternative if the buffers you get passed in don't meet your alignment requirements is to use a bounce buffer. The DMA pool API is a fine way to allocate these bounce buffers -- it's no problem if you sometimes need a smaller buffer; it's fine to leave some of the memory you get back unused.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio