using system call in Linux kernel file - linux-kernel

I am implementing a custom process scheduler in Linux. And I want to use a system call to record my program so that I can debug easily.
The file I write is
source code : linux-x.x.x/kernel/sched_new_scheduler.c
In sched_new_scheduler.c could I use syscall(the id of the system call, parameter); directly? It seems syscall(); is used with #include<sys/syscalls.h> in C program, but the ".h" can not be found in the kernel/.
I just want to know how my program executes by recording something, so could I directly write printk("something"); in sched_new_scheduler.c ? Or try a correct way to use system call?

System call look like wrapper around other kernel function one of ways how to use syscall inside kernel is find sub function for exact system call. For example:
int open(const char *pathname, int flags, mode_t mode); -> filp_open
////////////////////////////////////////////////////////////////////////////////////////////////
struct file* file_open(const char* path, int flags, int rights)
{
struct file* filp = NULL;
mm_segment_t oldfs;
int err = 0;
oldfs = get_fs();
set_fs(get_ds());
filp = filp_open(path, flags, rights);
set_fs(oldfs);
if(IS_ERR(filp)) {
err = PTR_ERR(filp);
return NULL;
}
return filp;
}
ssize_t write(int fd, const void *buf, size_t count); -> vfs_write
////////////////////////////////////////////////////////////////////////////////////////////////
int file_write(struct file* file, unsigned long long offset, unsigned char* data, unsigned int size)
{
mm_segment_t oldfs;
int ret;
oldfs = get_fs();
set_fs(get_ds());
ret = vfs_write(file, data, size, &offset);
set_fs(oldfs);
return ret;
}

A system call is supposed to be used by an application program to avail a service from kernel. You can implement a system call in your kernel module, but that should be called from an application program. If you just want to expose the statistics of your new scheduler to the userspace for debugging, you can use interfaces like proc, sys, debugfs etc. And that would be much more easier than implementing a system call and writing a userspace application to use it.

Related

How to trigger fops poll function from the kernel driver

I am working on a kernel driver which logs some spi data in a virtual file using debugfs.
My main goal is to be able to "listen" for incomming data from userspace using for example $ tail -f /sys/kernel/debug/spi-logs which is using select to wait for new data on the debugfs file.
I've implemented the fops poll function in the driver and when I am trying to get the data from the userspace, the poll function is never called even though there is new data available in the kernel to be read.
I assume that the poll function never gets called because the debugfs file never gets actually written.
My question is, is there a way to trigger the poll function from the kernel space when new data is available?
EDIT: Added an example
#include <linux/init.h>
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/debugfs.h>
#include <linux/wait.h>
#include <linux/poll.h>
struct module_ctx {
struct wait_queue_head wq;
};
struct module_ctx module_ctx;
static ssize_t debugfs_read(struct file *filp, char __user *buff, size_t count, loff_t *off)
{
// simulate no data left to read for now
return 0;
}
static __poll_t debugfs_poll(struct file *filp, struct poll_table_struct *wait) {
struct module_ctx *module_hdl;
__poll_t mask = 0;
module_hdl = filp->f_path.dentry->d_inode->i_private;
pr_info("CALLED!!!");
poll_wait(filp, &module_hdl->wq, wait);
if (is_data_available_from_an_external_ring_buffer())
mask |= POLLIN | POLLRDNORM;
return mask;
}
loff_t debugfs_llseek(struct file *filp, loff_t offset, int orig)
{
loff_t pos = filp->f_pos;
switch (orig) {
case SEEK_SET:
pos = offset;
break;
case SEEK_CUR:
pos += offset;
break;
case SEEK_END:
pos = 0; /* Going to the end => to the beginning */
break;
default:
return -EINVAL;
}
filp->f_pos = pos;
return pos;
}
static const struct file_operations debugfs_fops = {
.owner = THIS_MODULE,
.read = debugfs_read,
.poll = debugfs_poll,
.llseek = debugfs_llseek,
};
static int __init rb_example_init(void)
{
struct dentry *file;
init_waitqueue_head(&module_ctx.wq);
file = debugfs_create_file("spi_logs", 0666, NULL, &module_ctx,
&debugfs_fops);
if (!file) {
pr_err("qm35: failed to create /sys/kernel/debug/spi_logs\n");
return 1;
}
return 0;
}
static void __exit
rb_example_exit(void) {
}
module_init(rb_example_init);
module_exit(rb_example_exit);
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Mihai Pop");
MODULE_DESCRIPTION("A simple example Linux module.");
MODULE_VERSION("0.01");
Using tail -f /sys/kernel/debug/spi_logs, the poll function never gets called
Semantic of poll is to return whenever encoded operations (read and/or write) on a file would return without block. In case of read operation, "block" means:
If read is called in nonblocking mode (field f_flags of the struct file has flag O_NONBLOCK set), then it returns -EAGAIN.
If read is called in blocking mode, then it puts a thread into the waiting state.
As you can see, your read function doesn't follow that convention and returns 0, which means EOF. So the caller has no reason to call poll after that.
Semantic of -f option for tail:
... not stop when end of file is reached, but rather to wait ...
is about the situation, when read returns 0, but the program needs to wait.
As you can see, poll semantic is not suitable for such wait. Instead, such programs use inotify mechanism.

Cannot printk user space string parameter when intercepting a syscall

I'm trying to intercept a Linux syscall to record all opened filename to a log file. but there's a problem: it failed to printk the filename in user space. Here are the codes of fake syscall function:
static inline long hacked_open(const char __user *filename, int flags, umode_t mode)
{
char buf[256];
buf[255] = '\0';
long res = strncpy_from_user(buf, filename, 255);
if (res > 0)
printk("%s\n", buf);
else
printk("---err len : %ld ---\n", res);
orig_func a = (orig_func)orig_open;
return a(filename, flags, mode);
}
after I loaded the kernel module, dmesg showed a lot of message as:
---err len : -14---
I've tried copy_from_user and printk the filename directly, but they all doesn't work.
I've solved this problem by myself.
the parameters of hacked_open are wrong.
the correct hacked_openat should be :
asmlinkage long hacked_openat(struct pt_regs *regs)
and we can get filename from user-space like this:
int nRet = strncpy_from_user(filename, (char __user *)regs->si, 1024);

kernel tried to execute NX-protected page - exploit attempt?

Look at this very basic linux module:
static ssize_t checksec_write(struct file *f, const char __user *buf,size_t len, loff_t *off)
{
unsigned long addr_fonction_userspace;
memcpy(&addr_fonction_userspace,buf,sizeof(unsigned long));
void (*functionPtr)(void);
functionPtr = addr_fonction_userspace;
(*functionPtr)();
return len;
}
This module works this /dev/checksec char device.
and look at this basic c userland program:
void fonction_userland();
void fonction_userland()
{
asm ("nop");
}
void main()
{
FILE *f=fopen("/dev/checksec","w");
unsigned long addr_fonction_userland = &fonction_userland;
fwrite(&addr_fonction_userland,sizeof(unsigned long),1,f);
}
When i run the program, i get this error in dmesg:
kernel tried to execute NX-protected page - exploit attempt?
I understand it is an exploit attempt, because this is what i am trying to learn. But i do not understand why the page is NX-protected. The function is not in a non execute page, it is a userland function.
SMEP and SMAP are not enabled (bits 0 in CR4 regsiter)
Thanks a lot

how to transfer string(char*) in kernel into user process using copy_to_user

I'm making code to transfer string in kernel to usermode using systemcall and copy_to_user
here is my code
kernel
#include<linux/kernel.h>
#include<linux/syscalls.h>
#include<linux/sched.h>
#include<linux/slab.h>
#include<linux/errno.h>
asmlinkage int sys_getProcTagSysCall(pid_t pid, char **tag){
printk("getProcTag system call \n\n");
struct task_struct *task= (struct task_struct*) kmalloc(sizeof(struct task_struct),GFP_KERNEL);
read_lock(&tasklist_lock);
task = find_task_by_vpid(pid);
if(task == NULL )
{
printk("corresponding pid task does not exist\n");
read_unlock(&tasklist_lock);
return -EFAULT;
}
read_unlock(&tasklist_lock);
printk("Corresponding pid task exist \n");
printk("tag is %s\n" , task->tag);
/*
task -> tag : string is stored in task->tag (ex : "abcde")
this part is well worked
*/
if(copy_to_user(*tag, task->tag, sizeof(char) * task->tag_length) !=0)
;
return 1;
}
and this is user
#include<stdio.h>
#include<stdlib.h>
int main()
{
char *ret=NULL;
int pid = 0;
printf("PID : ");
scanf("%4d", &pid);
if(syscall(339, pid, &ret)!=1) // syscall 339 is getProcTagSysCall
printf("pid %d does not exist\n", pid);
else
printf("Corresponding pid tag is %s \n",ret); //my output is %s = null
return 0;
}
actually i don't know about copy_to_user well. but I think copy_to_user(*tag, task->tag, sizeof(char) * task->tag_length) is operated like this code
so i use copy_to_user like above
#include<stdio.h>
int re();
void main(){
char *b = NULL;
if (re(&b))
printf("success");
printf("%s", b);
}
int re(char **str){
char *temp = "Gdg";
*str = temp;
return 1;
}
Is this a college assignment of some sort?
asmlinkage int sys_getProcTagSysCall(pid_t pid, char **tag){
What is this, Linux 2.6? What's up with ** instead of *?
printk("getProcTag system call \n\n");
Somewhat bad. All strings are supposed to be prefixed.
struct task_struct *task= (struct task_struct*) kmalloc(sizeof(struct task_struct),GFP_KERNEL);
What is going on here? Casting malloc makes no sense whatsoever, if you malloc you should have used sizeof(*task) instead, but you should not malloc in the first place. You want to find a task and in fact you just overwrite this pointer's value few lines later anyway.
read_lock(&tasklist_lock);
task = find_task_by_vpid(pid);
find_task_by_vpid requires RCU. The kernel would have told you that if you had debug enabled.
if(task == NULL )
{
printk("corresponding pid task does not exist\n");
read_unlock(&tasklist_lock);
return -EFAULT;
}
read_unlock(&tasklist_lock);
So... you unlock... but you did not get any kind of reference to the task.
printk("Corresponding pid task exist \n");
printk("tag is %s\n" , task->tag);
... in other words by the time you do task->tag, the task may already be gone. What requirements are there to access ->tag itself?
if(copy_to_user(*tag, task->tag, sizeof(char) * task->tag_length) !=0)
;
What's up with this? sizeof(char) is guaranteed to be 1.
I'm really confused by this entire business.
When you have a syscall which copies data to userspace where amount of data is not known prior to the call, teh syscall accepts both buffer AND its size. Then you can return appropriate error if the thingy you are trying to copy would not fit.
However, having a syscall in the first place looks incorrect. In linux per-task data is exposed to userspace in /proc/pid/. Figuring out how to add a file to proc is easy and left as an exercise for the reader.
It's quite obvious from the way you fixed it. copy_to_user() will only copy data between two memory regions - one accessible only to kernel and the other accessible also to user. It will not, however, handle any memory allocation. Userspace buffer has to be already allocated and you should pass address of this buffer to the kernel.
One more thing you can change is to change your syscall to use normal pointer to char instead of pointer to pointer which is useless.
Also note that you are leaking memory in your kernel code. You allocate memory for task_struct using kmalloc and then you override the only pointer you have to this memory when calling find_task_by_vpid() and this memory is never freed. find_task_by_vpid() will return a pointer to a task_struct which already exists in memory so there is no need to allocate any buffer for this.
i solved my problem by making malloc in user
I changed
char *b = NULL;
to
char *b = (char*)malloc(sizeof(char) * 100)
I don't know why this work properly. but as i guess copy_to_user get count of bytes as third argument so I should malloc before assigning a value
I don't know. anyone who knows why adding malloc is work properly tell me

Is a spinlock necessary in this Linux device driver code?

Is the following Linux device driver code safe, or do I need to protect access to interrupt_flag with a spinlock?
static DECLARE_WAIT_QUEUE_HEAD(wq_head);
static int interrupt_flag = 0;
static ssize_t my_write(struct file* filp, const char* __user buffer, size_t length, loff_t* offset)
{
interrupt_flag = 0;
wait_event_interruptible(wq_head, interrupt_flag != 0);
}
static irqreturn_t handler(int irq, void* dev_id)
{
interrupt_flag = 1;
wake_up_interruptible(&wq_head);
return IRQ_HANDLED;
}
Basically, I kick off some event in my_write() and wait for the interrupt to indicate that it completes.
If so, which form of spin_lock() do I need to use? I thought spin_lock_irq() was appropriate, but when I tried that I got a warning about the IRQ handler enabling interrupts.
Doesn't wait_event_interruptible evaluate the interrupt_flag != 0 condition? That would imply that the lock should be held while it reads the flag, right?
No lock is needed in the example given. Memory barriers are needed after the store of the flag, and before the load -- to ensure visibility to the flag -- but the wait_event_* and wake_up_* functions provide those. See the section entitled "Sleep and wake-up functions" in this document: https://www.kernel.org/doc/Documentation/memory-barriers.txt
Before adding a lock, consider what is being protected. Generally locks are needed if you're setting two or more separate pieces of data and you need to ensure that another cpu/core doesn't see an incomplete intermediate state (after you started but before you finished). In this case, there's no point in protecting the storing / loading of the flag value because stores and loads of a properly aligned integer are always atomic.
So, depending on what else your driver is doing, it's quite possible you do need a lock, but it isn't needed for the snippet you've provided.
Yes you need a lock. With the given example (that uses int and no specific arch is mentioned), the process context may be interrupted while accessing the interrupt_flag. Upon return from the IRQ, it may continue and interrupt_flag may be left in inconsistent state.
Try this:
static DECLARE_WAIT_QUEUE_HEAD(wq_head);
static int interrupt_flag = 0;
DEFINE_SPINLOCK(lock);
static ssize_t my_write(struct file* filp, const char* __user buffer, size_t length, loff_t* offset)
{
/* spin_lock_irq() or spin_lock_irqsave() is OK here */
spin_lock_irq(&lock);
interrupt_flag = 0;
spin_unlock_irq(&lock);
wait_event_interruptible(wq_head, interrupt_flag != 0);
}
static irqreturn_t handler(int irq, void* dev_id)
{
unsigned long flags;
spin_lock_irqsave(&lock, flags);
interrupt_flag = 1;
spin_unlock_irqrestore(&lock, flags);
wake_up_interruptible(&wq_head);
return IRQ_HANDLED;
}
IMHO, the code has to be written without making any arch or compiler-related assumptions (like the 'properly aligned integer' in Gil Hamilton answer).
Now if we can change the code and use atomic_t instead of the int flag, then no locks should be needed.

Resources