How to split bio into multiple bios? - linux-kernel

I want create a block device that get a bio with request for n sector and split it into n bio with 1 sector. I used bio_split but it doesn't work and reaches BUG_ON.
Is there any function to do such thing?
If there's not can anyone help me to write a function to do that?
It's also fine to have a function that split a bio into 4k bios.

The split_bio() function only works for bios with a single page (when bi_vcnt field is exactly 1).
To deal with bios with multiple pages - and I suspect you deal with these most of the time - you have to create new bios and set them up so that they contain only a single sector.
Tip: If the sector size is the same as the page size (currently 4K), and your block driver tells the kernel to supply no less than this size, than you only have to put each page from the incoming bio to the new bio. If the sector size is less then the page size, than the logic will be a bit more complicated.
Use bio_kmalloc to allocate the new bios and copy the data onto the memory pages in them manually.

Related

Using SetFilePointer to change the location to write in the sector doesn't work?

I'm using SetFilePointer to rewrite the second half of the MBR with something, its a user-mode application and i opened a handle to PhysicalDrive
At first i tried to set the size parameter in WriteFile to 256 but the writefile gave the INVALID_PARAMETER error, as it turns out based on some search on other questions here it seems like this is because we are forced to write in multiplicand of the sector size when the handle is PhysicalDrive for some reason
then i tried to set the filePointer to 256, and Write 512 bytes, both of them return no error, but for some unknown reason it writes from the beginning of the sector! as if the SetFilePointer didn't work even tho the return value of SetFilePointer is OK and it returns 256
So my questions is :
Why the write size have to be multiplicand of sector size when the handle is PhysicalDrive? which other device handles are like this?
Why is this happening and when I set the file pointer to 256, WriteFile still writes from the start?
isn't this really redundant, considering that even if I want to change 1 byte then I have to read the entire sector, change the one byte and then write it back, instead of just writing 1 byte, it seems like 10 times more overhead! isn't there a faster way to write a few bytes in a sector?
I think you are mixing the file system and the storage (block device). File system stays above storage device stack. If your code obtains a handle to a file system device, you can write byte by byte. But if you are accessing storage device stack, you can only write sector by sector (or block size).
Directly writing to block device is definitely slow as you discovered. However, in most cases, people just talk to file systems. Most file system drivers maintain cache and use algorithms for both read and write to improve performance.
Can't comment on file pointer based offset before seeing the actual code. But I guess it might be not sector aligned or it's not used at all.

What is the best way to reserve a very large virtual address space (TBs) in the kernel?

I'm trying to manually update TLB to translate new virtual memory pages into a certain set of physical pages in the kernel space. I want to do this in do_page_fault so that whenever a load/store instruction happens in a particular virtual address range (not already assigned), it put a page table entry in TLB in advance. The translation is simple. For example, I would like the following piece of code work properly:
int d;
int *pd = (int*)(&d + 0xffff000000000000);
*pd = 25; // A page fault occurs and TLB is updated
printk("d = %d\n", *pd); // No page fault (TLB is already up-to-date)
So, the translation is just a subtraction by 0xffff000000000000. I was wondering what is the best way to implement the TLB update functionality?
Edit: The main reason for doing that is to be able to reserve a large virtual memory space in the kernel. I just want to handle page faults in a particular address range in the kernel. So, first I have to reserve the address range (maybe exceeds the 100TB limitation). Then, I have to implement a page cache for that particular range. If it is not possible, what is the best solution for that?

kernel and user space sync

I have memory area mapped to user space with do_mmap_pgoff() and remap_pfn_range() and I have the same area mapped to kernel with ioremap().
When I write to this area from user space and then read from kernel space I see that not all bytes was written to memory area.
When I write from user space then read from user and after that read from kernel everything fine. Reading from user space pushing changes made previously.
I understand that cache or buffer exist between kernel and user spaces. I understand that I need to implement some flush-invalidate or buffer dump to memory area.
I tried to make this VMA uncached with pgprot_uncached(), I tried to implement outer cache range flush-invalidate, VMA cache range flush, VMA tlb range flush but it all dont work as I expected. All flush-inval operations just clears memory area but I need to apply changes made from user space. Using uncached memory slows up the process of data transferring.
How to do that synchronization between user and kernel correctly?
I have nearly the same question as you.
I use a shared memory region to pass data between kernel and user space. In kernel, I directly use physical address to access data. In user space, I open /dev/mem and mmap it to read/write.
And problem comes: When I write data to address A from user space, the kernel may not receive the data, and even covers data in A with it's previous value. I think CPU cache may cause this problem.
Here is my solution:
I open /dev/mem like this:
fd = open("/dev/mem", O_RDWR);
NOT this:
fd = open("/dev/mem", O_RDWR | O_SYNC);
And problem solved.

Memory mapping in Virtual Address Space(VAS)

This [wiki article] about Virtual memory says:
The process then starts executing bytes in the exe file. However, the
only way the process can use or set '-' values in its VAS is to ask
the OS to map them to bytes from a file. A common way to use VAS
memory in this way is to map it to the page file.
A diagram follows :
0 4GB
VAS |---vvvvvvv----vvvvvv---vvvv----vv---v----vvv--|
mapping ||||||| |||||| |||| || | |||
file bytes app.exe kernel user system_page_file
I didn't understand the part values in its VAS is to ask the OS to map them to bytes from a file.
What is the system page file here?
First off, I can't imagine such a badly written article to exist in Wikipedia. One has to be an expert already familiar with the topic before being able to understand what was described.
Assuming you understand the rest of the article, the '-' part represents unallocated virtual address within the 4GB address space available to a process. So the sentence "the only way the process can use or set '-' values in its VAS is to ask the OS to map them to bytes from a file" means to allocate virtual memory address e.g. in a Windows native program calling VirtualAlloc(), or a C program calling malloc() to allocate some memory to store program data while those memory were not already existing in the current process's virtual address space.
When Windows allocates memory to a process address space, it normally associate those memory with the paging file in the hard disk. The c:\pagefile.sys is this paging file which is the system_page_file mentioned in the article. Memory page is swapped out to that file when there is not enough physical page to accommodate the demand.
Hope that clarifies

Accessing user space data from linux kernel

This is an assignment problem which asks for partial implementation of process checkpointing:
The test program allocates an array, does a system call and passes the start and end address of array to the call. In the system call function I have to save the contents in the give range to a file.
From my understanding, I could simply use copy_from_usr function to save the contents from the give range. However since the assignment is based on topic "Process address space", I probably need to walk through page tables. Say I manage to get the struct pages that correspond to given range. How do I get the data corresponding to the pages?
Can I just use page_to_virt function and access data directly?
Since the array is contiguous in virtual space, I guess I will just need to translate the starting address to page and then back to virtual address and then just copy the range size of data to file. Is that right?
I think copy_from_user() is ok, nothing else needed. When executing the system call, although it trap to kernel space, the context is still the process context which doing the system call. The kernel still use the process's page table. So just to use copy_from_user(), and nothing else needed.
Okey, if you want to do this experiment, I think you can use the void __user *vaddr to traverse the mm->pgd(page table), using pgd_offset/pud_offset/pmd_offset/pte_offset to get the page physical address(page size alignment). Then in kernel space, using ioremap() to create a kernel space mapping, then using the kernel virtual address(page size) + offset(inside the page), you get the start virtual address of the array. Now in kernel, you can using the virtual address to access the array.

Resources