Linux Kernel - programmatically retrieve block numbers as they are written to - linux-kernel

I want to maintain a list of block numbers as they are physically written to using the linux kernel source. I plan to modify the kernel source to do this. I just need to find the structure and functions in the kernel source that handle writing to physical partitions and get the block numbers as they write to the physical partition.
Any way of doing this? Any help is appreciated. If I can find where the kernel is actually writing to the partitions and returning the block numbers, that'd work.

I believe you could do this entirely from userspace, without modifying the kernel, using the blktrace interface.

It isn't just one place to check. For instance, if the block device was an iSCSI or AoE target, you would be looking for their respective drivers, then ultimately the same on the other end.
The same would go for normal SCSI, misc flash devices, etc, minus network interaction.
VFS just pulls these all together in a convenient, unified and consistent interface for calls like read() and write() to work while providing buffering. The actual magic, including ordering and write barriers are handled by the block dev drivers themselves.
In the case of using device mapper, the path alters slightly. It goes from vfs -> dm_(target) -> blockdev_driver.

Related

Where are page permissions stored on hardware and how can I alter them directly?

I'm trying to write a pseudo kernel driver (it uses CVE 2018-8120 to get kernel permission so it's technically not a driver) and I want to be as safe as possible when entering ring0. I'm writing a function to read and write MSR's from userland, and before the transition to ring0 I'm trying to guarantee that the void pointer given to my function can be written, I decided the ideal way to do this was to make it writable if it is not already.
The problem is that the only way I know how to do this is with VirtualProtect() and NtAllocateVirtualMemory, but VirtualProtect() sometimes fails and returns an error instead. I want to know precisely where these access permissions are stored (in ram? in some special CPU register?) how I can obtain their address and how can I modify them directly?
User-mode code should never try to muck around in kernel data structures, and any properly written kernel will prevent it anyway. The best way for user mode code to ensure that an address can be written is to write to it. If the page was not already writeable, the page fault will cause the kernel to make it so.
Nevertheless, the kernel code /cannot/ rely on the application having done so, for two reasons:
1) Even if the application does it properly, the page might be unmapped again before (or after) entering ring 0.
2) The kernel should /never/ rely on application code to do the right thing. It always has to protect itself.
The access permissions information and page data is stored in the page directory, page table, CR0 and CR3.
More information can be found here: https://wiki.osdev.org/Paging.

Providing a basic filesystem from a char driver

I have an existing Linux device driver that exposes a basic char device to userland. (I am not its original author, but I'm trying to modify it.)
Currently it provides a maze of ioctl functions to do various things (though also wrapped in a handy library so most user code doesn't need to deal with the details of it).
One of the things that it does is to provide a sub-stream interface, where given a bunch of device-specific identifying information (including a string and some numeric ids) it can read or write (but not both at once) some data (up to a small number of MB) in a strictly sequential manner. Currently it does this with explicit ioctls.
I'm wondering if there is a way to leverage the existing file_operations infrastructure or similar to provide either a virtual filesystem or just an ioctl that can return a new already-open fd that can then be used with read/write/close (but not lseek) from userland as you'd normally expect?
The device does have a concept of a filename (that's the string) but it is not possible to enumerate existing valid filenames (only to try to open a specific filename and see if it gives an error or not), and the filename is not sufficient to open a stream by itself, which is why I'm currently leaning more towards the "special open" ioctl on the parent device rather than trying to expose things directly in some userland-visible fs that can be opened directly. (Also there's no concept of subdirs and only basic write-protect permissions, so a full fs seems like overkill anyway.) But I'm willing to be persuaded otherwise if there's a better way to do it.
I have written basic char drivers from scratch myself before, so I'm reasonably confident that I can get the read/write ops and other supporting things to work; I'm just not sure how to best handle that initial step of opening the handle.
I'm currently targeting kernel 3.2+.
Edit: The main reason that I think making an actual filesystem (or trying to expose it via procfs or sysfs) wouldn't work is that there's no way to populate a directory -- the only ops available are "open for read" and "open for write", and there's no way to tell which names are valid prior to the open attempt (the files are stored in external hardware and accessed via a protocol I cannot change). If I'm missing something and it is possible to support this sort of thing, that would be useful to know as well.
You can most certainly create a file system where readdir() is not implemented, but the open() method is. It's normally not done because it's not particularly user-friendly, but it certainly is doable.
You're targetting really ancient kernels if you're looking at 3.2 -- the upstream kernel developers aren't even bother to try to backport security fixes that far back, so I certainly wouldn't recommend shipping something as ancient as 3.2, but it's technically doable.
All you need to do is to implement lookup() method in the inode_operations structure for directories. You'll need to figure out some way of creating inodes with unique inode numbers, that contains private information so you can identify the subtream. The inode will have a file_operations structure that implements the read/write methods for reading and writing the substream.
You can try looking at a simple file system such as cramfs or minix to see how things are done.

How should different Linux device tree drivers share common registers?

I'm working on a port of the Linux kernel to an unsupported ARM SoC platform. Unfortunately, on this SoC, different peripherals will sometimes share registers or commingle registers within the same region of memory. This is giving me grief with the Device Tree specification which doesn't seem to support the notion of different devices sharing the same set of registers or registers commingled in the same address space. Various documents I've read on the device tree don't suggest the proper way to handle this.
My simple approach to specify the same register region within multiple drivers throws "can't request region for resource" for the second device that attempts to map the same register region as another driver. From my understanding, this results from the kernel enforcing device tree rules regarding register regions.
What is the preferred general solution for solving this dilemma? Should there be a higher level driver that marshals access to the shared register region? Are there examples in the existing Linux kernel that address this specific issue (I couldn't find any, but I may not be sure what to look for)?
I am facing exactly the same problem. My solution is to create a separate module to guard common resources and then write 'client modules' that use symbols exported from the common module.
Note that this makes sense from the safety point of view as well. How would you otherwise implement proper memory locking and ensure operation coherency across several independent modules?
You can still use devm_ioremap() directly but extra caution has to be exercised with some synchronization.
Below is an example from upstream,
https://github.com/torvalds/linux/blob/master/drivers/usb/phy/phy-tegra-usb.c#L1368

Detect write to DebugFS

I have a kernel module that creates several DebugFS entries, each 4 to 8 bytes. I would like to use one (or more) of these entries to initiate action within the kernel module--in other words, I want to use an entry for configuration purposes.
Is there a common idiom to detect the user write to the DebugFS entry without polling (some kind of user-space to kernel space signal) within my kernel module, or is sleep/poll the best (only?) option.
Helper functions like debugfs_create_u32() are intended for cases where you want to be able to change a variable without any other helper code.
If you want to do anything but setting a variable, you have to implement your own file operations with debugfs_create_file().

make_request and queue limits

I'm writing a linux kernel module that emulates a block device.
There are various calls that can be used to tell the block size to the kernel, so it aligns and sizes every request toward the driver accordingly. This is well documented in the "Linux Device Drives 3" book.
The book describes two methods of implementing a block device: using a "request" function, or using a "make_request" function.
It is not clear, whether the queue limit calls apply when using the minimalistic "make_request" approach (which is also the more efficient one if the underlying device is has really no benefit from sequential over random IO, which is the case with me).
I would really like to get the kernel to talk to me using 4K block sizes, but I see smaller bio-s hitting my make_request function.
My question is that should the blk_queue_limit_* affect the bio size when using make_request?
Thank you in advance.
I think I've found enough evidence in the kernel code that if you use make_request, you'll get correctly sized and aligned bios.
The answer is:
You must call blk_queue_make_request first, because it sets queue limits to defaults. After this, set queue limits as you'd like.
It seems that every part of the kernel submitting bios are do check for validity, and it's up to the submitter to do these checks. I've found incomplete validation in submit_bio and generic_make_request. But as long as no one does tricks, it's fine.
Since it's a policy to submit correct bio's, but it's up to the submitter to take care, and no one in the middle does, I think I have to implement explicit checks and fail the wrong bio-s. Since it's a policy, it's fine to fail on violation, and since it's not enforced by the kernel, it's a good thing to do explicit checks.
If you want to read a bit more on the story, see http://tlfabian.blogspot.com/2012/01/linux-block-device-drivers-queue-and.html.

Resources