Linux Device Driver Program, where the program starts? - linux-kernel

I've started to learn Linux driver programs, but I'm finding it a little difficult.
I've been studying the i2c driver, and I got quite confused regarding the entry-point of the driver program. Does the driver program start at the MOUDULE_INIT() macro?
And I'd also like to know how I can know the process of how the driver program runs. I got the book, Linux Device Driver, but I'm still quite confused. Could you help me? Thanks a lot.
I'll take the i2c driver as an example. There are just so many functions in it, I just wanna know how I can get coordinating relation of the functions in the i2c drivers?

A device driver is not a "program" that has a main {} with a start point and exit point. It's more like an API or a library or a collection of routines. In this case, it's a set of entry points declared by MODULE_INIT(), MODULE_EXIT(), perhaps EXPORT_SYMBOL() and structures that list entry points for operations.
For block devices, the driver is expected to provide the list of operations it can perform by declaring its functions for those operations in (from include/linux/blkdev.h):
struct block_device_operations {
int (*open) ();
int (*release) ();
int (*ioctl) ();
int (*compat_ioctl) ();
int (*direct_access) ();
unsigned int (*check_events) ();
/* ->media_changed() is DEPRECATED, use ->check_events() instead */
int (*media_changed) ();
void (*unlock_native_capacity) ();
int (*revalidate_disk) ();
int (*getgeo)();
/* this callback is with swap_lock and sometimes page table lock held */
void (*swap_slot_free_notify) ();
struct module *owner;
};
For char devices, the driver is expected to provide the list of operations it can perform by declaring its functions for those operations in (from include/linux/fs.h):
struct file_operations {
struct module *owner;
loff_t (*llseek) ();
ssize_t (*read) ();
ssize_t (*write) ();
ssize_t (*aio_read) ();
ssize_t (*aio_write) ();
int (*readdir) ();
unsigned int (*poll) ();
long (*unlocked_ioctl) ();
long (*compat_ioctl) ();
int (*mmap) ();
int (*open) ();
int (*flush) ();
int (*release) ();
int (*fsync) ();
int (*aio_fsync) ();
int (*fasync) ();
int (*lock) ();
ssize_t (*sendpage) ();
unsigned long (*get_unmapped_area)();
int (*check_flags)();
int (*flock) ();
ssize_t (*splice_write)();
ssize_t (*splice_read)();
int (*setlease)();
long (*fallocate)();
};
For platform devices, the driver is expected to provide the list of operations it can perform by declaring its functions for those operations in (from include/linux/platform_device.h):
struct platform_driver {
int (*probe)();
int (*remove)();
void (*shutdown)();
int (*suspend)();
int (*resume)();
struct device_driver driver;
const struct platform_device_id *id_table;
};
The driver, especially char drivers, does not have to support every operation listed. Note that there are macros to facilitate the coding of these structures by naming the structure entries.
Does the driver program starts at the MOUDLUE_INIT() macro?
The driver's init() routine specified in MODULE_INIT() will be called during boot (when statically linked in) or when the module is dynamically loaded. The driver passes its structure of operations to the device's subsystem when it registers itself during its init().
These device driver entry points, e.g. open() or read(), are typically executed when the user app invokes a C library call (in user space) and after a switch to kernel space. Note that the i2c driver you're looking at is a platform driver for a bus that is used by leaf devices, and its functions exposed by EXPORT_SYMBOL() would be called by other drivers.
Only the driver's init() routine specified in MODULE_INIT() is guaranteed to be called. The driver's exit() routine specified in MODULE_EXIT() would only be executed if/when the module is dynamically unloaded. The driver's op routines will be called asynchronously (just like its interrupt service routine) in unknown order. Hopefully user programs will invoke an open() before issuing a read() or an ioctl() operation, and invoke other operations in a sensible fashion. A well-written and robust driver should accommodate any order or sequence of operations, and produce sane results to ensure system integrity.

It would probably help to stop thinking of a device driver as a program. They're completely different. A program has a specific starting point, does some stuff, and has one or more fairly well defined (well, they should, anyway) exit point. Drivers have some stuff to do when the first get loaded (e.g. MODULE_INIT() and other stuff), and may or may not ever do anything ever again (you can forcibly load a driver for hardware your system doesn't actually have), and may have some stuff that needs to be done if the driver is ever unloaded. Aside from that, a driver generally provides some specific entry points (system calls, ioctls, etc.) that user-land applications can access to request the driver to do something.
Horrible analogy, but think of a program kind of like a car - you get in, start it up, drive somewhere, and get out. A driver is more like a vending machine - you plug it in and make sure it's stocked, but then people just come along occasionaly and push buttons to make it do something.

Actually you are taking about (I2C) platform (Native)driver first you need to understand how MOUDULE_INIT() of platform driver got called versus other loadable modules.
/*
* module_init() - driver initialization entry point
* #x: function to be run at kernel boot time or module insertion
* module_init() will either be called during do_initcalls() (if
* builtin) or at module insertion time (if a module). There can only
* be one per module.*/
and for i2c driver you can refer this link http://www.linuxjournal.com/article/7136 and
http://www.embedded-bits.co.uk/2009/i2c-in-the-2632-linux-kernel/

Begin of a kernel module is starting from initialization function, which mainly addressed with macro __init just infront of the function name.
The __init macro indicate to linux kernel that the following function is an initialization function and the resource that will use for this initialization function will be free once the code of initialization function is executed.
There are other marcos, used for detect initialization and release function, named module_init() and module_exit() [as described above].
These two macro are used, if the device driver is targeted to operate as loadable and removeable kernel module at run time [i.e. using insmod or rmmod command]

IN short and crisp way : It starts from .probe and go all the way to init as soon you do insmod .This also registers the driver with the driver subsystem and also initiates the init.
Everytime the driver functionalities are called from the user application , functions are invoked using the call back.

"Linux Device Driver" is a good book but it's old!
Basic example:
#include <linux/module.h>
#include <linux/version.h>
#include <linux/kernel.h>
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Name and e-mail");
MODULE_DESCRIPTION("my_first_driver");
static int __init insert_mod(void)
{
printk(KERN_INFO "Module constructor");
return 0;
}
static void __exit remove_mod(void)
{
printk(KERN_INFO "Module destructor");
}
module_init(insert_mod);
module_exit(remove_mod);
An up-to-date tutorial, really well written, is "Linux Device Drivers Series"

Related

How can I check whether a kernel address belongs to the Linux kernel executable, and not just the core kernel text?

Coming from the Windows world, I assume that Vmlinuz is equivalent to ntoskrnl.exe, and this is the kernel executable that gets mapped in Kernel memory.
Now I want to figure out whether an address inside kernel belongs to the kernel executable or not. Is using core_kernel_text the correct way of finding this out?
Because core_kernel_text doesn't return true for some of the addresses that clearly should belong to Linux kernel executable.
For example the core_kernel_text doesn't return true when i give it the syscall entry handler address which can be found with the following code:
unsigned long system_call_entry;
rdmsrl(MSR_LSTAR, system_call_entry);
return (void *)system_call_entry;
And when I use this code snippet, the address of the syscall entry handler doesn't belong to the core kernel text or to any kernel module (using get_module_from_addr).
So how can an address for a handler that clearly belongs to Linux kernel executable such as syscall entry, don't belong to neither the core kernel or any kernel module? Then what does it belong to?
Which API do I need to use for these type of addresses that clearly belong to Linux kernel executable to assure me that the address indeed belongs to kernel?
I need such an API because I need to write a detection for malicious kernel modules that patch such handlers, and for now I need to make sure the address belongs to kernel, and not some third party kernel module or random kernel address. (Please do not discuss methods that can be used to bypass my detection, obviously it can be bypassed but that's another story)
The target kernel version is 4.15.0-112-generic, and is Ubuntu 16.04 as a VMware guest.
Reproducible code as requested:
typedef int(*core_kernel_text_t)(unsigned long addr);
core_kernel_text_t core_kernel_text_;
core_kernel_text_ = (core_kernel_text_t)kallsyms_lookup_name("core_kernel_text");
unsigned long system_call_entry;
rdmsrl(MSR_LSTAR, system_call_entry);
int isInsideCoreKernel = core_kernel_text_((unsigned long)system_call_entry);
printk("%d , 0x%pK ", isInsideCoreKernel, system_call_entry);
EDIT1: So in the MSR_LSTAR example that I gave above, it turns out that It's related to Kernel Page Table Isolation and CONFIG_RETPOLINE=y in config:
system_call value is different each time when I use rdmsrl(MSR_LSTAR, system_call)
And that's why I am getting the address 0xfffffe0000006000 aka SYSCALL64_entry_trampoline, the same as the question above.
So now the question remains, why this SYSCALL64_entry_trampoline address doesn't belong to anything? It doesn't belong to any kernel module, and it doesn't belong to the core kernel, so which executable this address belongs to and how can I check that with an API similar to core_kernel_text? It seems like it belongs to cpu_entry_area, but what is that and how can I check if an address belongs to that?
You are seeing this "weird" address in MSR_LSTAR (IA32_LSTAR) because of Kernel Page-Table Isolation (KPTI), which mitigates Meltdown. As other existing answers(1) you already found point out, the address you see is the one of a small trampoline (entry_SYSCALL_64_trampoline) that is dynamically remapped at boot time by the kernel for each CPU, and thus does not have an address within the kernel text.
(1)By the way, the answer linked above wrongly states that the corresponding config option for KPTI is CONFIG_RETPOLINE=y. This is wrong, the "retpoline" is a mitigation for Spectre, not Meltdown. The config to enable KPTI is CONFIG_PAGE_TABLE_ISOLATION=y.
You don't have many options. Either:
Tell VMWare to emulate a recent CPU that is not vulnerable to Meltdown.
Detect and implement support for the KPTI trampoline.
You can implement support for this by detecting whether the kernel supports KPTI (CONFIG_PAGE_TABLE_ISOLATION), and if so check whether current CPU has KPTI enabled. The code at kernel/cpu/bugs.c that provides information for /sys/devices/system/cpu/vulnerabilities/meltdown shows how this can be detected:
ssize_t cpu_show_meltdown(struct device *dev,
struct device_attribute *attr, char *buf)
{
if (!boot_cpu_has_bug(X86_BUG_CPU_MELTDOWN))
return sprintf(buf, "Not affected\n");
if (boot_cpu_has(X86_FEATURE_PTI))
return sprintf(buf, "Mitigation: PTI\n");
return sprintf(buf, "Vulnerable\n");
}
The actual trampoline is set up at boot and its address is stored in each CPU's "entry area" for later use (e.g. here when setting up IA32_LSTAR). This answer on Unix & Linux SE explains the purpose of the cpu entry area and its relation to KPTI.
In your module you can do the following detection:
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/init.h>
#include <linux/kallsyms.h>
#include <asm/msr-index.h>
#include <asm/msr.h>
#include <asm/cpufeature.h>
#include <asm/cpu_entry_area.h>
// ...
typedef int(*core_kernel_text_t)(unsigned long addr);
core_kernel_text_t core_kernel_text_;
bool syscall_entry_64_ok(void)
{
unsigned long entry;
rdmsrl(MSR_LSTAR, entry);
if (core_kernel_text_(entry))
return true;
#ifdef CONFIG_PAGE_TABLE_ISOLATION
if (this_cpu_has(X86_FEATURE_PTI)) {
int cpu = smp_processor_id();
unsigned long trampoline = (unsigned long)get_cpu_entry_area(cpu)->entry_trampoline;
if ((entry & PAGE_MASK) == trampoline)
return true;
}
#endif
return false;
}
static int __init modinit(void)
{
core_kernel_text_ = (core_kernel_text_t)kallsyms_lookup_name("core_kernel_text");
if (!core_kernel_text_)
return -EOPNOTSUPP;
pr_info("syscall_entry_64_ok() -> %d\n", syscall_entry_64_ok());
return 0;
}

Can I query device tree items without creating a platform device?

I am writing a kernel module intended to functionally test a device driver kernel module for an ARM+FPGA SOC system. My approach involves finding which interrupts the device driver is using by querying the device tree. In the device driver itself, I register a platform driver using platform_driver_register and in the .probe function I am passed a platform_device* pointer that contains the device pointer. With this I can call of_match_device and irq_of_parse_and_map, retrieving the irq numbers.
I don't want to register a second platform driver just to query the device tree this way in the test module. Is there some other way I can query the device tree (perhaps more directly, by name perhaps?)
This is what I've found so far, and it seems to work. of_find_compatible_node does what I want. Once I have the device_node*, I can call irq_of_parse_and_map (since of_irq_get_byname doesn't seem to compile for me). I can use it something like the following:
#include <linux/of.h>
#include <linux/of_irq.h>
....
int get_dut_irq(char* dev_compatible_name)
{
struct device_node* dev_node;
int irq = -1;
dev_node = of_find_compatible_node(NULL, NULL, dev_compatible_name);
if (!dev_node)
return -1;
irq = irq_of_parse_and_map(dev_node, 0);
of_node_put(dev_node);
return irq;
}

What does actually cdev_add() do? in terms of registering a device to the kernel

What does cdev_add() actually do? I'm asking terms of registering a device with the kernel.
Does it add the pointer to cdev structure in some map which is indexed by major and minor number? How exactly does this happen when you say the device is added/registered with the kernel. I want to know what steps the cdev_add takes to register the device in the running kernel. We create a node for user-space using mknod command. Even this command is mapped using major and minor number. Does registration also do something similar?
cdev_add registers a character device with the kernel. The kernel maintains a list of character devices under cdev_map
static struct kobj_map *cdev_map;
kobj_map is basically an array of probes, which in this case is the list of character devices:
struct kobj_map {
struct probe {
struct probe *next;
dev_t dev;
unsigned long range;
struct module *owner;
kobj_probe_t *get;
int (*lock)(dev_t, void *);
void *data;
} *probes[255];
struct mutex *lock;
};
You can see that each entry in the list has the major and minor number for the device (dev_t dev), and the device structure (in the form of kobj_probe_t, which is a kernel object, which represents a cdev in this case). cdev_add adds your character device to the probes list:
int cdev_add(struct cdev *p, dev_t dev, unsigned count)
{
...
error = kobj_map(cdev_map, dev, count, NULL,
exact_match, exact_lock, p);
When you do an open on a device from a process, the kernel finds the inode associated to the filename of your device (via namei function). The inode has the major a minor number for the device (dev_t i_rdev), and flags (imode) indicating that it is a special (character) device. With this it can access the cdev list I explained above, and get the cdev structure instantiated for your device. From there it can create a struct file with the file operations to your cdev, and install a file descriptor in the process's file descriptor table.
This is what actually 'registering' a character device means and why it needs to be done. Registering a block device is similar. The kernel maintains another list for registered gendisks.
You can read Linux Device Driver. It is a little bit old, but the main ideas are the same. It is difficoult to explain a simple operation like cdev_add() and all the stuff around in few lines.
I suggest you to read the book and the source code. If you have trouble to navigate your source code, you can use some tag system like etags + emacs, or the eclipse indexer.
Please see the code comments here:
cdev_add() - add a char device to the system 464 *
#p: the cdev structure for the device 465 * #dev: the first device
number for which this device is responsible 466 * #count: the number
of consecutive minor numbers corresponding to this 467 *
device 468 * 469 * cdev_add() adds the device represented by #p to
the system, making it 470 * live immediately. A negative error code
is returned on failure. 471 */ `
the immediate answer to any such question is read the code. Thats what Linus say.
[edit]
the cdev_add basically adds the device to the system. What it means essentially is that after the cdev_add operation your new device will get visibility through the /sys/ file system. The function does all the necessary house keeping activities related to that particularly the kobj reference to your device will get inserted at its position in the object hierarchy. If you want to get more information about it, I would suggest some reading around /sysfs/ and struct kboj

Which suspend/resume pointer is the right one to use?

I am working on power management on an i2c driver and noticed something odd.
include/linux/i2c.h
struct i2c_driver {
//...
int (*suspend)(struct i2c_client *, pm_message_t mesg);
int (*resume)(struct i2c_client *);
//...
struct device_driver driver;
//...
}
include/linux/device.h
struct device_driver {
//...
int (*suspend) (struct device *dev, pm_message_t state);
int (*resume) (struct device *dev);
//...
const struct dev_pm_ops *pm;
//...
}
include/linux/pm.h
struct dev_pm_ops {
//...
int (*suspend)(struct device *dev);
int (*resume)(struct device *dev);
//...
}
Why are there so many suspend and resume function pointers? Some kind of legacy thing? Which one should I use for my driver?
I am on an old kernel (2.6.35)
Thanks!
Why are there so many suspend and resume function pointers?
i2c_driver - legacy support.
device_driver - standard support.
dev_pm_ops - extended power management.
Notice that they are all function pointers. There is sequencing for the suspend and resume. For instance, the i2c controller must be suspended after the devices but resumed before.
Some kind of legacy thing?
The struct i2c_driver is a legacy mechism. It existed before the entire power infrastructure was created. As well, some configurations may exclude the full struct dev_pm_ops pointer, but have the suspend and resume driver hooks. The full struct dev_pm_ops supports suspend to disk and other features. Suspend to memory is more common and is given pointers space in struct device_driver. If struct dev_pm_ops is non-NULL, the two pointer will be the same. These two should call the same routine in your driver.
Which one should I use for my driver?
You probably shouldn't use any of them. It is probably more likely that your driver is part of some other sub-system. see Note For instance, the i2c is used in codecs like the wm8940.c. Generally, i2c is not the central controlling sub-system. It is a driver that is used by something else to control a chip-set. The sound sub-system will be suspended before i2c and it is better to put your hooks there. If your driver is pure i2c, then use the macros in pm.h, like SET_SYSTEM_SLEEP_PM_OPS to conditionalize the setting of dev_pm_ops; so set both of them. Most likely the device_driver will be copied to dev_pm_ops if it exists, but doing it explicitly is better.
The driver-model, i2c power management and power driver documentation have more information on the structure and overviews.
Note: In this case, there are multiple device_driver structures. Usually the i2c_driver is managed by a controlling driver. The controlling driver should implement a suspend/resume for it's sub-system, which uses the i2c_driver interface.

Should open method in Linux device driver return a file descriptor?

I'm studying Linux Device Driver programming 3rd edition and I have some questions about the open method, here's the "scull_open" method used in that book:
int scull_open(struct inode *inode, struct file *filp){
struct scull_dev *dev; /* device information */
dev = container_of(inode->i_cdev, struct scull_dev, cdev);
filp->private_data = dev; /* for other methods */
/* now trim to 0 the length of the device if open was write-only */
if ( (filp->f_flags & O_ACCMODE) == O_WRONLY) {
if (down_interruptible(&dev->sem))
return -ERESTARTSYS;
scull_trim(dev); /* ignore errors */
up(&dev->sem);
}
return 0; /* success */
}
And my questions are:
Shouldn't this function returns a file descriptor to the device just opened?
Isn't the "*filp" local to this function, then why we copy the contents of dev to it?
How we could use later in read and write methods?
could someone writes to my a typical "non-fatty" implementation of open method?
ssize_t scull_read(struct file *filp, char __user *buf, size_t count, loff_t *f_pos){
struct scull_dev *dev = filp->private_data;
...}
Userspace open function is what you are thinking of, that is a system call which returns a file descriptor int. Plenty of good references for that, such as APUE 3.3.
Device driver "open method" is a function within file_operations structure. It is different than userspace "file open". With the device driver installed, when user code does open of the device (e.g. accessing /dev/scull0), this "open method" would then get called.
Shouldn't this function returns a file descriptor to the device just opened ?
In linux device driver, open() returns either 0 or negative error code. File descriptor is internally managed by kernel.
Isn't the "*filp" local to this function, then why we copy the contents of dev to it ?
filp represents opened file and a pointer to it is passed to driver by kernel. Sometimes this filp is used sometimes it is not needed by driver. Copy contents of dev is required so that when some other function let us say read() is called driver can retrieve some device specific data.
How we could use later in read and write methods ?
One of the most common way to use filp in read/write() is to acquire lock. When device is opened driver will create a lock. Now when a read/write occurs same lock will be used to prevent data buffer from corruption in case multiple process are accessing same device.
could someone writes to my a typical "non-fatty" implementation of open method ?
As you are just studying please enjoy exploring more. An implementation can be found here

Resources