Which kernel function creates /dev/dm-x for LVM? - linux-kernel

My understanding is, at boot time, a block device (with lvm on its partitions) triggers a udev rule (in xx-lvm.rules) which runs pvscan to activate the lvm drives. pvscan calls the ioctl(fd, DM_DEV_CREATE_CMD,...) on /dev/mapper/control device file, which was created by module dm_mod.
The simplest lvm setup is nothing more than several 'linear' dm mapped devices concatenated together, therefore this ioctl() is fully handled by kernel device-mapper code, specifically in dm-ioctl.c.
In lookup_ioctl(), DM_DEV_CREATE_CMD is handled by dev_create(). However from dev_create(), I didn't find any code that actually calls device_register() or device_add() or devtmpfs_create_node() to make a new device and register it. What did I miss?

Related

How to access physical address during interrupt handler linux

I wrote an interrupt handler in linux.
Part of the handler logic I need to access physical address.
I used iormap function but then I fell into KDB during handler time.
I started to debug it and i saw the below code which finally called by ioremap
What should I do? Is there any other way instead of map the region before?
If i will need to map it before it means that i will probably need to map and cache a lot of unused area.
BTW what are the limits for ioremap?
Setting up a new memory mapping is an expensive operation, which typically requires calls to potentially blocking functions (e.g. grabbing locks). So your strategy has two problems:
Calling a blocking function is not possible in your context (there is no kernel thread associated with your interrupt handler, so there is no way for the kernel to resume it if it had to be put to sleep).
Setting up/tearing down a mapping per IRQ would be a bad idea performance-wise (even if we ignore the fact that it can't be done).
Typically, you would setup any mappings you need in a driver's probe() function (or in the module's init() if it's more of a singleton thing). This mapping is then kept in some private device data structure, which is passed as the last argument to some variant of request_irq(), so that the kernel then passes it back as the second argument to the IRQ handler.
Not sure what you mean by "need to map and cache a lot of unused area".
Depending on your particular system, you may end up consuming an entry in your CPU's MMU, or you may just re-use a broader mapping that was setup by whoever wrote the BSP. That's just the cost of doing business on a virtual memory system.
Caching is typically not enabled on I/O memory because of the many side-effects of both reads and writes. For the odd cases when you need it, you have to use ioremap_cached().

Linux kernel initialization - When are devicetree blobs parsed and tree nodes are loaded?

I would like to establish a milestone roadmap for Linux initialization for me to easily understand. (For an embedded system) Here is what I got:
Bootloader loads kernel to RAM and starts it
Linux kernel enters head.o, starts start_kernel()
CPU architecture is found, MMU is started.
setup_arch() is called, setting CPU up.
Kernel subsystems are loaded.
do_initcalls() is called and modules with *_initcall() and module_init() functions are started.
Then /sbin/init (or alike) is run.
I don't know when exactly devicetree is processed here. Is it when do_initcall() functions are beings processed or is it something prior to that?
In general when devicetree is parsed, and when tree nodes are processed?
Thank you very much in advance.
Any correction to my thoughts are highly appreciated.
It's a good question.
Firstly, I think you already know that the kernel will use data in the DT to identify the specific machine, in case of general use across different platform or hardware, we need it to establish in the early boot so that it has the opportunity to run machine-specific fixups.
Here is some information I digest from linux kernel documents.
In the majority of cases, the machine identity is irrelevant, and the kernel will instead select setup code based on the machine’s core CPU or SoC. On ARM for example, setup_arch() in arch/arm/kernel/setup.c will call setup_machine_fdt() in arch/arm/kernel/devtree.c which searches through the machine_desc table and selects the machine_desc which best matches the device tree data. It determines the best match by looking at the ‘compatible’ property in the root device tree node, and comparing it with the dt_compat list in struct machine_desc (which is defined in arch/arm/include/asm/mach/arch.h if you’re curious).
As for the Linux Initialization, I think there are something we can add in the list.
Put on START button, reset signal trigger
CS:IP fix to the BIOS 0XFFFF0 address
Jump to the start of BIOS
Self-check, start of hardware device like keyboard, real mode IDT & GDT
Load Bootloader like grub2 or syslinux.
Bootloader loads kernel to RAM and starts it (boot.img->core.img).
A20 Open, call setup.s, switch into protected mode
Linux kernel enters head.o, IDT & GDT refresh, decompress_kernel(), starts start_kernel()
INIT_TASK(init_task) create
trap_init()
CPU architecture is found, MMU is started (mmu_init()).
setup_arch() is called, setting CPU up.
Kernel subsystems are loaded.
do_initcalls() is called and modules with *_initcall() and module_init() functions are started.
rest_init() will create process 1 & 2, in other word, /sbin/init (or alike) and kthreadd is run.

Silently discard writes to mmap region

I have a Linux device driver which allows a userspace process to mmap() certain regions of the device's MMIO space for writing. The device may at some point decide to revoke access to the region, and will notify the driver when this happens. The driver (asynchronously) notifies the userspace process to stop using this region.
I'd like the driver to immediately zap the PTEs for this mapping so they can be returned to device control, however, the userspace process might still be finishing a write. I'd like to simply discard these writes. The user does not need to know which writes made it to the device and which writes were discarded. What can the driver's fault handler do after zapping the PTEs that can discard writes to the region harmlessly?
For the userspace process to make progress, the PTE needs to end up pointing to a writeable page.
If you don't want it writing to your device MMIO region, this implies you'll need to allocate a page of normal memory for the write to go to, just like the fault handler does for an anonymous VMA.
Alternatively, you could let your userspace task take a SIGBUS when this revocation event occurs, and just specify that a task using this device should expect this to happen and must install a SIGBUS handler that uses longjmp() to cancel its attempt to write to the device. The downside of this approach - apart from the additional complexity it dumps onto userspace - is that it makes using your device difficult from a library, as signal handlers are process-global state.

Making a virtual IOPCIDevice with IOKit

I have managed to create a virtual IOPCIDevice which attaches to IOResources and basically does nothing. I'm able to get existing drivers to register and match to it.
However when it comes to IO handling, I have some trouble. IO access by functions (e.g. configRead, ioRead, configWrite, ioWrite) that are described in IOPCIDevice class can be handled by my own code. But drivers that use memory mapping and IODMACommand are the problem.
There seems to be two things that I need to manage: IODeviceMemory(described in the IOPCIDevice) and DMA transfer.
How could I create a IODeviceMemory that ultimately points to memory/RAM, so that when driver tries to communicate to PCI device, it ultimately does nothing or just moves the data to RAM, so my userspace client can handle this data and act as an emulated PCI device?
And then could DMA commands be directed also to my userspace client without interfering to existing drivers' source code that use IODMACommand.
Thanks!
Trapping memory accesses
So in theory, to achieve what you want, you would need to allocate a memory region, set its protection bits to read-only (or possibly neither read nor write if a read in the device you're simulating has side effects), and then trap any writes into your own handler function where you'd then simulate device register writes.
As far as I'm aware, you can do this sort of thing in macOS userspace, using Mach exception handling. You'd need to set things up that page protection fault exceptions from the process you're controlling get sent to a Mach port you control. In that port's message handler, you'd:
check where the access was going to
if it's the device memory, you'd suspend all the threads of the process
switch the thread where the write is coming from to single-step, temporarily allow writes to the memory region
resume the writer thread
trap the single-step message. Your "device memory" now contains the written value.
Perform your "device's" side effects.
Turn off single-step in the writer thread.
Resume all threads.
As I said, I believe this can be done in user space processes. It's not easy, and you can cobble together the Mach calls you need to use from various obscure examples across the web. I got something similar working once, but can't seem to find that code anymore, sorry.
… in the kernel
Now, the other problem is you're trying to do this in the kernel. I'm not aware of any public KPIs that let you do anything like what I've described above. You could start looking for hacks in the following places:
You can quite easily make IOMemoryDescriptors backed by system memory. Don't worry about the IODeviceMemory terminology: these are just IOMemoryDescriptor objects; the IODeviceMemory class is a lie. Trapping accesses is another matter entirely. In principle, you can find out what virtual memory mappings of a particular MD exist using the "reference" flag to the createMappingInTask() function, and then call the redirect() method on the returned IOMemoryMap with a NULL backing memory argument. Unfortunately, this will merely suspend any thread attempting to access the mapping. You don't get a callback when this happens.
You could dig into the guts of the Mach VM memory subsystem, which mostly lives in the osfmk/vm/ directory of the xnu source. Perhaps there's a way to set custom fault handlers for a VM region there. You're probably going to have to get dirty with private kernel APIs though.
Why?
Finally, why are you trying to do this? Take a step back: What is it you're ultimately trying to do with this? It doesn't seem like simulating a PCI device in this way is an end to itself, so is this really the only way to do what greater goal you're ultimately trying to achieve? See: XY problem

How linux drive many network cards with the same driver?

I am learning linux network driver recently, and I wonder that if I have many network cards in same type on my board, how does the kernel drive them? Does the kernel need to load the same driver many times? I think it's not possible, insmod won't do that, so how can I make all same kind cards work at same time?
regards
The state of every card (I/O addresses, IRQs, ...) is stored into a driver-specific structure that is passed (directly or indirectly) to every entry point of the driver which can this way differenciate the cards. That way the very same code can control different cards (which means that yes, the kernel only keeps one instance of a driver's module no matter the number of devices it controls).
For instance, have a look at drivers/video/backlight/platform_lcd.c, which is a very simple LCD power driver. It contains a structure called platform_lcd that is private to this file and stores the state of the LCD (whether it is powered, and whether it is suspended). One instance of this structure is allocated in the probe function of the driver through kzalloc - that is, one per LCD device - and stored into the platform device representing the LCD using platform_set_drvdata. The instance that has been allocated for this device is then fetched back at the beginning of all other driver functions so that it knows which instance it is working on:
struct platform_lcd *plcd = to_our_lcd(lcd);
to_our_lcd expands to lcd_get_data which itself expands to dev_get_drvdata (a counterpart of platform_set_drvdata) if you look at include/linux/lcd.h. The function can then know the state of the device is has been invoked for.
This is a very simple example, and the platform_lcd driver does not directly control any device (this is deferred to a function pointer in the platform data), but add hardware-specific parameters (IRQ, I/O base, etc.) and you get how 99% of the drivers in Linux work.
The driver code is only loaded once, but it allocates a separate context structure for each card. Typically you will see a struct pci_driver with a .probe function pointer. The probe function is called once for each card by the PCI support code, and it calls alloc_etherdev to allocate a network interface with space for whatever private context it needs.

Resources