How kernel knows the presence of netlink in userspace? - linux-kernel

I am working on netlink sockets, written code for application and kernel module. Kernel module will send notifications regularly to user space application. If application gets killed, kernel module doesn't stop sending notification. How the kernel will know when application gets killed? Can we user bind and unbind in netlink_kernel_cfg for this purpose? I have searched a lot but didn't find any information on this.

When any application is killed in Linux, all file descriptors (including socket descriptors) are closed. In order to have your kernel module notified when the application closes the netlink socket you need to implement the optional .unbind op in struct netlink_kernel_cfg.
include/linux/netlink.h
/* optional Netlink kernel configuration parameters */
struct netlink_kernel_cfg {
unsigned int groups;
unsigned int flags;
void (*input)(struct sk_buff *skb);
struct mutex *cb_mutex;
int (*bind)(struct net *net, int group);
void (*unbind)(struct net *net, int group);
bool (*compare)(struct net *net, struct sock *sk);
};
Set the configuration parameters in your module:
struct netlink_kernel_cfg cfg = {
.unbind = my_unbind,
};
netlink = netlink_kernel_create(&my_netlink, ... , &cfg);
To understand how this is used, note the following fragment from the netlink protocol family proto_ops definition:
static const struct proto_ops netlink_ops = {
.family = PF_NETLINK,
.owner = THIS_MODULE,
.release = netlink_release,
.release is called upon close of the socket (in your case, due to the application being killed).
As a part of its cleanup process, netlink_release() has the following implementation:
net/netlink/af_netlink.c
if (nlk->netlink_unbind) {
int i;
for (i = 0; i < nlk->ngroups; i++)
if (test_bit(i, nlk->groups))
nlk->netlink_unbind(sock_net(sk), i + 1);
Here you can see that if the optional netlink_unbind has been provided, then it will be executed, providing you a callback when your application has closed the socket (either gracefully or as a result of being killed).

Related

What is the purpose of the function "blk_rq_map_user" in the NVME disk driver?

I am trying to understand the nvme linux drivers. I am now tackling the function nvme_user_submit_cmd, which I report partially here:
static int nvme_submit_user_cmd(struct request_queue *q,
struct nvme_command *cmd, void __user *ubuffer,
unsigned bufflen, void __user *meta_buffer, unsigned meta_len,
u32 meta_seed, u32 *result, unsigned timeout)
{
bool write = nvme_is_write(cmd);
struct nvme_ns *ns = q->queuedata;
struct gendisk *disk = ns ? ns->disk : NULL;
struct request *req;
struct bio *bio = NULL;
void *meta = NULL;
int ret;
req = nvme_alloc_request(q, cmd, 0, NVME_QID_ANY);
[...]
if (ubuffer && bufflen) {
ret = blk_rq_map_user(q, req, NULL, ubuffer, bufflen,
GFP_KERNEL);
[...]
The ubufferis a pointer to some data in the virtual address space (since this comes from an ioctl command from a user application).
Following blk_rq_map_user I was expecting some sort of mmap mechanism to translate the userspace address into a physical address, but I can't wrap my head around what the function is doing. For reference here's the call chain:
blk_rq_map_user -> import_single_range -> blk_rq_map_user_iov
Following those function just created some more confusion for me and I'd like some help.
The reason I think that this function is doing a sort of mmap is (apart from the name) that this address will be part of the struct request in the struct request queue, which will eventually be processed by the NVME disk driver (https://lwn.net/Articles/738449/) and my guess is that the disk wants the physical address when fulfilling the requests.
However I don't understand how is this mapping done.
ubuffer is a user virtual address, which means it can only be used in the context of a user process, which it is when submit is called. To use that buffer after this call ends, it has to be mapped to one or more physical addresses for the bios/bvecs. The unmap call frees the mapping after the I/O completes. If the device can't directly address the user buffer due to hardware constraints then a bounce buffer will be mapped and a copy of the data will be made.
Edit: note that unless a copy is needed, there is no kernel virtual address mapped to the buffer because the kernel never needs to touch the data.

Linux Char Driver

Can anyone tell me how a Char Driver is bind to the corresponding physical device?
Also, I would like to know where inside a char driver we are specifying the physical device related information, which can be used by kernel to do the binding.
Thanks !!
A global array — bdev_map for block and cdev_map for character devices — is used to implement a hash table, which employs the device major number as hash key.
while registering for char driver following calls get in invoked to get major and minor numbers.
int register_chrdev_region(dev_t from, unsigned count, const char *name)
int alloc_chrdev_region(dev_t *dev, unsigned baseminor, unsigned count,
const char *name);
After a device number range has been obtained, the device needs to be activated by adding it to the character device database.
void cdev_init(struct cdev *cdev, const struct file_operations *fops);
int cdev_add(struct cdev *p, dev_t dev, unsigned count);
Here on cdev structure initialize with file operation and respected character device.
Whenever a device file is opened, the various filesystem implementations invoke the init_special_inode function to create the inode for a block or character device file.
void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
{
inode->i_mode = mode;
if (S_ISCHR(mode)) {
inode->i_fop = &def_chr_fops;
inode->i_rdev = rdev;
} else if (S_ISBLK(mode)) {
inode->i_fop = &def_blk_fops;
inode->i_rdev = rdev;
}
else
printk(KERN_DEBUG "init_special_inode: bogus i_mode (%o)\n",
mode);
}
now the default_chr_fpos chrdev_open() method will get invoked. which will look up for the inode->rdev device in cdev_map array and will get a instance of cdev structure. with the reference to cdev it will bind the file->f_op to cdev file operation and invoke the open method for character driver.
In a character driver like I2C client driver, We specify the slave address in the client structure's "addr" field and then call i2c_master_send() or i2c_master_receive() on this client . This calls will ultimately go to the main adapter controlling that line and the adapter then communicates with the device specified by the slave address.
And the binding of drivers operations is done mainly with cdev_init() and cdev_add() functions.
Also driver may choose to provide probe() function and let kernel find and bind all the devices which this driver is capable of supporting.

How to send signal from kernel to user space

My kernel module code needs to send signal [def.] to a user land program, to transfer its execution to registered signal handler.
I know how to send signal between two user land processes, but I can not find any example online regarding the said task.
To be specific, my intended task might require an interface like below (once error != 1, code line int a=10 should not be executed):
void __init m_start(){
...
if(error){
send_signal_to_userland_process(SIGILL)
}
int a = 10;
...
}
module_init(m_start())
An example I used in the past to send signal to user space from hardware interrupt in kernel space. That was just as follows:
KERNEL SPACE
#include <asm/siginfo.h> //siginfo
#include <linux/rcupdate.h> //rcu_read_lock
#include <linux/sched.h> //find_task_by_pid_type
static int pid; // Stores application PID in user space
#define SIG_TEST 44
Some "includes" and definitions are needed. Basically, you need the PID of the application in user space.
struct siginfo info;
struct task_struct *t;
memset(&info, 0, sizeof(struct siginfo));
info.si_signo = SIG_TEST;
// This is bit of a trickery: SI_QUEUE is normally used by sigqueue from user space, and kernel space should use SI_KERNEL.
// But if SI_KERNEL is used the real_time data is not delivered to the user space signal handler function. */
info.si_code = SI_QUEUE;
// real time signals may have 32 bits of data.
info.si_int = 1234; // Any value you want to send
rcu_read_lock();
// find the task with that pid
t = pid_task(find_pid_ns(pid, &init_pid_ns), PIDTYPE_PID);
if (t != NULL) {
rcu_read_unlock();
if (send_sig_info(SIG_TEST, &info, t) < 0) // send signal
printk("send_sig_info error\n");
} else {
printk("pid_task error\n");
rcu_read_unlock();
//return -ENODEV;
}
The previous code prepare the signal structure and send it. Bear in mind that you need the application's PID. In my case the application from user space send its PID through ioctl driver procedure:
static long dev_ioctl(struct file *file, unsigned int cmd, unsigned long arg) {
ioctl_arg_t args;
switch (cmd) {
case IOCTL_SET_VARIABLES:
if (copy_from_user(&args, (ioctl_arg_t *)arg, sizeof(ioctl_arg_t))) return -EACCES;
pid = args.pid;
break;
USER SPACE
Define and implement the callback function:
#define SIG_TEST 44
void signalFunction(int n, siginfo_t *info, void *unused) {
printf("received value %d\n", info->si_int);
}
In main procedure:
int fd = open("/dev/YourModule", O_RDWR);
if (fd < 0) return -1;
args.pid = getpid();
ioctl(fd, IOCTL_SET_VARIABLES, &args); // send the our PID as argument
struct sigaction sig;
sig.sa_sigaction = signalFunction; // Callback function
sig.sa_flags = SA_SIGINFO;
sigaction(SIG_TEST, &sig, NULL);
I hope it helps, despite the fact the answer is a bit long, but it is easy to understand.
You can use, e.g., kill_pid(declared in <linux/sched.h>) for send signal to the specified process. To form parameters to it, see implementation of sys_kill (defined as SYSCALL_DEFINE2(kill) in kernel/signal.c).
Note, that it is almost useless to send signal from the kernel to the current process: kernel code should return before user-space program ever sees signal fired.
Your interface is violating the spirit of Linux. Don't do that..... A system call (in particular those related to your driver) should only fail with errno (see syscalls(2)...); consider eventfd(2) or netlink(7) for such asynchronous kernel <-> userland communications (and expect user code to be able to poll(2) them).
A kernel module could fail to be loaded. I'm not familiar with the details (never coded any kernel modules) but this hello2.c example suggests that the module init function can return a non zero error code on failure.
People are really expecting that signals (which is a difficult and painful concept) are behaving as documented in signal(7) and what you want to do does not fit in that picture. So a well behaved kernel module should never asynchronously send any signal to processes.
If your kernel module is not behaving nicely your users would be pissed off and won't use it.
If you want to fork your experimental kernel (e.g. for research purposes), don't expect it to be used a lot; only then could you realistically break signal behavior like you intend to do, and you could code things which don't fit into the kernel module picture (e.g. add a new syscall). See also kernelnewbies.

Is a spinlock necessary in this Linux device driver code?

Is the following Linux device driver code safe, or do I need to protect access to interrupt_flag with a spinlock?
static DECLARE_WAIT_QUEUE_HEAD(wq_head);
static int interrupt_flag = 0;
static ssize_t my_write(struct file* filp, const char* __user buffer, size_t length, loff_t* offset)
{
interrupt_flag = 0;
wait_event_interruptible(wq_head, interrupt_flag != 0);
}
static irqreturn_t handler(int irq, void* dev_id)
{
interrupt_flag = 1;
wake_up_interruptible(&wq_head);
return IRQ_HANDLED;
}
Basically, I kick off some event in my_write() and wait for the interrupt to indicate that it completes.
If so, which form of spin_lock() do I need to use? I thought spin_lock_irq() was appropriate, but when I tried that I got a warning about the IRQ handler enabling interrupts.
Doesn't wait_event_interruptible evaluate the interrupt_flag != 0 condition? That would imply that the lock should be held while it reads the flag, right?
No lock is needed in the example given. Memory barriers are needed after the store of the flag, and before the load -- to ensure visibility to the flag -- but the wait_event_* and wake_up_* functions provide those. See the section entitled "Sleep and wake-up functions" in this document: https://www.kernel.org/doc/Documentation/memory-barriers.txt
Before adding a lock, consider what is being protected. Generally locks are needed if you're setting two or more separate pieces of data and you need to ensure that another cpu/core doesn't see an incomplete intermediate state (after you started but before you finished). In this case, there's no point in protecting the storing / loading of the flag value because stores and loads of a properly aligned integer are always atomic.
So, depending on what else your driver is doing, it's quite possible you do need a lock, but it isn't needed for the snippet you've provided.
Yes you need a lock. With the given example (that uses int and no specific arch is mentioned), the process context may be interrupted while accessing the interrupt_flag. Upon return from the IRQ, it may continue and interrupt_flag may be left in inconsistent state.
Try this:
static DECLARE_WAIT_QUEUE_HEAD(wq_head);
static int interrupt_flag = 0;
DEFINE_SPINLOCK(lock);
static ssize_t my_write(struct file* filp, const char* __user buffer, size_t length, loff_t* offset)
{
/* spin_lock_irq() or spin_lock_irqsave() is OK here */
spin_lock_irq(&lock);
interrupt_flag = 0;
spin_unlock_irq(&lock);
wait_event_interruptible(wq_head, interrupt_flag != 0);
}
static irqreturn_t handler(int irq, void* dev_id)
{
unsigned long flags;
spin_lock_irqsave(&lock, flags);
interrupt_flag = 1;
spin_unlock_irqrestore(&lock, flags);
wake_up_interruptible(&wq_head);
return IRQ_HANDLED;
}
IMHO, the code has to be written without making any arch or compiler-related assumptions (like the 'properly aligned integer' in Gil Hamilton answer).
Now if we can change the code and use atomic_t instead of the int flag, then no locks should be needed.

writing data to debugfs --- from a device driver

With proc we can easily use read & write system call as shown in this example.
write on /proc entry through user space
But i am working on passing information from driver to user-space using debugfs.
I am able to find these two example code. Here application is able to read and write to debugfs file using mmap() system call.
http://people.ee.ethz.ch/~arkeller/linux/code/mmap_simple_kernel.c
http://people.ee.ethz.ch/~arkeller/linux/code/mmap_user.c
But suppose in my case requirement for communicating using Debugfs file with device driver:
user-space application <-------> debugfs file <-------> Device driver
So can i use same code mmap_simple_kernel.c inside my --->> device driver code --->> and transfer data to debugfs directly from driver ? But in this case there will be two file_operations structures inside my driver will it cause some problem ? Is it right approach ?
Or just like application is following process in -- mmap_user.c --- same process -- i follow in my device driver program. And keep mmap_simple_kernel.c as seprate module for debugfs entry ?
You can take a look at how kmemleak uses debugfs in mm/kmemleak.c:
static const struct seq_operations kmemleak_seq_ops = {
.start = kmemleak_seq_start,
.next = kmemleak_seq_next,
.stop = kmemleak_seq_stop,
.show = kmemleak_seq_show,
};
static int kmemleak_open(struct inode *inode, struct file *file)
{
return seq_open(file, &kmemleak_seq_ops);
}
static int kmemleak_release(struct inode *inode, struct file *file)
{
return seq_release(inode, file);
}
static ssize_t kmemleak_write(struct file *file, const char __user *user_buf,
size_t size, loff_t *ppos)
{...}
static const struct file_operations kmemleak_fops = {
.owner = THIS_MODULE,
.open = kmemleak_open,
.read = seq_read,
.write = kmemleak_write,
.llseek = seq_lseek,
.release = kmemleak_release,
};
dentry = debugfs_create_file("kmemleak", S_IRUGO, NULL, NULL,
&kmemleak_fops);
This question is the top search result in Google for mmap debugfs. I am adding here an important information. According to this https://lkml.org/lkml/2016/5/21/73 post debugfs_create_file() in the kernel 4.8.0 or higher will ignore .mmap field in the struct file_operations
Use debugfs_create_file_unsafe() as a workaround

Resources