Busy inodes/dentries after umount in self-written virtual fs - linux-kernel

I wrote a simple FS that should only statically contain one file named hello. This file should contain the string Hello, world!. I did this for educational purposes. While the fs is mounted it actually behaves like expected. I can read the file just fine.
However after unmounting I always get
VFS: Busy inodes after unmount of dummyfs. Self-destruct in 5 seconds. Have a nice day...
If I called ls on the rootdir while the fs was mounted I get
BUG: Dentry (ptrval){i=2,n=hello} still in use (-1) [unmount of dummyfs dummyfs]
on top of that.
What does this mean in detail and how can I fix it?
The mount and kill_sb routines call mount_nodev and allocate space for a struct holding the 2 inodes this FS uses.
static struct dentry *dummyfs_mount(struct file_system_type* fs_type,
int flags, const char* dev_name, void* data)
{
struct dentry *ret;
ret = mount_nodev(fs_type, flags, data, dummyfs_fill_super);
if (IS_ERR(ret)) {
printk(KERN_ERR "dummyfs_mount failed");
}
return ret;
}
static void dummyfs_kill_sb(struct super_block *sb) {
kfree(sb->s_fs_info);
kill_litter_super(sb);
}
The fill superblock method creates the 2 inodes and saves them in the struct allocated by mount:
static int dummyfs_fill_super(struct super_block *sb, void *data, int flags)
{
struct dummyfs_info *fsi;
sb->s_magic = DUMMYFS_MAGIC;
sb->s_op = &dummyfs_sops;
fsi = kzalloc(sizeof(struct dummyfs_info), GFP_KERNEL);
sb->s_fs_info = fsi;
fsi->root = new_inode(sb);
fsi->root->i_ino = 1;
fsi->root->i_sb = sb;
fsi->root->i_op = &dummyfs_iops;
fsi->root->i_fop = &dummyfs_dops;
fsi->root->i_atime = fsi->root->i_mtime = fsi->root->i_ctime = current_time(fsi->root);
inode_init_owner(fsi->root, NULL, S_IFDIR);
fsi->file = new_inode(sb);
fsi->file->i_ino = 2;
fsi->file->i_sb = sb;
fsi->file->i_op = &dummyfs_iops;
fsi->file->i_fop = &dummyfs_fops;
fsi->file->i_atime = fsi->file->i_mtime = fsi->file->i_ctime = current_time(fsi->file);
inode_init_owner(fsi->file, fsi->root, S_IFREG);
sb->s_root = d_make_root(fsi->root);
return 0;
}
The lookup method just adds the fsi->file_inode to the dentry if the parent is the root dir:
if (parent_inode->i_ino == fsi->root->i_ino) {
d_add(child_dentry, fsi->file);
}
And the iterate method just emits the dot files and the hello file when called:
if (ctx->pos == 0) {
dir_emit_dots(file, ctx);
ret = 0;
}
if (ctx->pos == 2) {
dir_emit(ctx, "hello", 5, file->f_inode->i_ino, DT_UNKNOWN);
++ctx->pos;
ret = 0;
}
The read method just writes a static string using copy_to_user. The offsets are calculated correctly and on EOF the method just returns 0. However since the problems occur even when the read method was not called I think it is out-of-scope for this already too long question.
For actually running this I use user-mode linux from the git master (4.15+x commit d48fcbd864a008802a90c58a9ceddd9436d11a49). The userland is compiled from scratch and the init process is a derivative of Rich Felker's minimal init to which i added mount calls for /proc, /sys and / (remount).
My command line is ./linux ubda=../uml/image root=/dev/ubda
Any pointers to more thorough documentation are also appreciated.

Using gdb watching the dentry->d_lockref.count I realized that the kill_litter_super call in umount was actually responsible for the dentry issues. Replacing it with kill_anon_super solved that problem.
The busy inode problem vanished too mostly except when i unmounted after immediately after mounting. Allocating the second inode lazily solved that problem too.

Related

Linux kernel_write function returns EFBIG when appending data to big file

The task is to write simple character device that copies all the data written to the device to tmp a file.
I use kernel_write function to write data to file and its work fine most of the cases. But when the output file size is bigger than 2.1 GB, kernel_write function fails with return value -27.
To write to file I use this function:
void writeToFile(void* buf, size_t size, loff_t *offset) {
struct file* destFile;
char* filePath = "/tmp/output";
destFile = filp_open(filePath, O_CREAT|O_WRONLY|O_APPEND, 0666);
if (IS_ERR(destFile) || destFile == NULL) {
printk("Cannot open destination file");
return;
}
size_t res = kernel_write(destFile, buf, size, offset);
printk("%ld", res);
filp_close(destFile, NULL);
}
If the size of "/tmp/output" < 2.1 GB, this function works just fine.
If the size of "/tmp/output"> 2.1 GB, kernel_write starts to return -27.
How can I fix this?
Thanks
You need to enable Large File Support (LFS) with the O_LARGEFILE flag.
The below code worked for me. Sorry, I made some other changes for debugging, but I commented above the relevant line.
struct file* writeToFile(void* buf, size_t size, loff_t *offset)
{
struct file* destFile;
char* filePath = "/tmp/output";
size_t res;
// Add the O_LARGEFILE flag here
destFile = filp_open(filePath, O_CREAT | O_WRONLY | O_APPEND | O_LARGEFILE, 0666);
if (IS_ERR(destFile))
{
printk("Error in opening: <%ld>", (long)destFile);
return destFile;
}
if (destFile == NULL)
{
printk("Error in opening: null");
return destFile;
}
res = kernel_write(destFile, buf, size, offset);
printk("CODE: <%ld>", res);
filp_close(destFile, NULL);
return destFile;
}
To test it, I created a file with fallocate -l 3G /tmp/output, then removed the O_CREAT flag because it was giving the kernel permission errors.
I should add a disclaimer that a lot of folks says that File I/O from the kernel is a bad idea. Even while testing this out on my own, I accidentally crashed my computer twice due to dumb errors.
Maybe do this instead: Read/Write from /proc File

MacOS shm - Unable to get true data size in shm

When performing shm-related development on MacOS, the searched processes are shown in the following code (verification is indeed correct).
However, there is a new problem that cannot be solved. It is found that when ftruncat adjusts the memory size for shm_fd, it is allocated according to the multiple of the page size.
But in this case, when the shared memory file is opened by other processes, the actual data size cannot be obtained correctly. The obtained file size is an integer multiple of the page, which will cause an error when appending data.
// write data_size = 12
char *data = "....";
long data_size = 12;
shmFD = shm_open(...);
ftruncate(shmFD, data_size); // Actually the size actually allocated is not 12, but 4096
shmAddr = (char *)mmap(NULL, data_size, ... , shmFD, 0);
memcpy(shmAddr, data, data_size);
// read
...
fstat(shmFD, &sb)
long context_len_in_shm = sb.st_size;
// get wrong shm size -> context_len_in_shm = 4096
Temporarily use the following structure to record data into shm. The first operation before writing or reading is to get the value of the data_len field, and then determine the length of the data to be read and written from the back. Hope for a more concise way, just like the use of lseek() under Linux.
shm mem map :
----shm mem----
struct {
long data_len;
data[1];
data[2];
...
data[data_len];
}
---------------
long *shm_mem = (long *)shmAddr;
long data_size = shm_mem[0]; // Before reading, you need to determine whether the shm file is empty and whether the pointer is valid. It is omitted here.
char *shm_data = (char *)&(shm_mem[1]);
char *buffer = (char *)malloc(data_size);
memcpy(buffer, shm_data, data_size);

Copy structure with included user pointers from user space to kernel space (copy_from_user)

I want to transfer a transaction structure, which contains an user space pointer to an array, to kernel by using copy_from_user.
The goal is, to get access to the array elements in kernel space.
User space side:
I allocate an array of _sg_param structures in user space. Now i put the address of this array in a transaction structure (line (*)).
Then i transfer the transaction structure to the kernel via ioctl().
Kernel space side:
On executing this ioctl, the complete transaction structure is copied to kernel space (line ()). Now kernel space is allocated for holding the array (line (*)). Then i try to copy the array from user space to the new allocated kernel space (line (****)), and here start my problems:
The kernel is corrupted during execution of this copy. dmesg shows following output:
[ 54.443106] Unhandled fault: page domain fault (0x01b) at 0xb6f09738
[ 54.448067] pgd = ee5ec000
[ 54.449465] [b6f09738] *pgd=2e9d7831, *pte=2d56875f, *ppte=2d568c7f
[ 54.454411] Internal error: : 1b [#1] PREEMPT SMP ARM
Any ideas ???
Following an simplified extract of my code:
// structure declaration
typedef struct _sg_param {
void *seg_buf;
int seg_len;
int received;
} sg_param_t;
struct transaction {
...
int num_of_elements;
sg_param_t *pbuf_list; // Array of sg_param structure
...
} trans;
// user space side:
if ((pParam = (sg_param_t *) malloc(NR_OF_STRUCTS * sizeof(sg_param_t))) == NULL) {
return -ENOMEM;
}
else {
trans.num_of_elements = NR_OF_STRUCTS;
trans.pbuf_list = pParam; // (*)
}
rc = ioctl(dev->fd, MY_CMD, &trans);
if (rc < 0) {
return rc;
}
// kernel space side
static long ioctl(struct file *file, unsigned int cmd, unsigned long arg)
{
arg_ptr = (void __user *)arg;
// Perform the specified command
switch (cmd) {
case MY_CMD:
{
struct transaction *__user user_trans;
user_trans = (struct transaction *__user)arg_ptr;
if (copy_from_user(&trans, arg_ptr, sizeof(trans)) != 0) { // (**)
k_err("Unable to copy transfer info from userspace for "
"AXIDMA_DMA_START_DMA.\n");
return -EFAULT;
}
int size = trans.num_of_elements * sizeof(sg_param_t);
if (trans.pbuf_list != NULL) {
// Allocate kernel memory for buf_list
trans.pbuf_list = (sg_param_t *) kmalloc(size, GFP_KERNEL); // (***)
if (trans.pbuf_list == NULL) {
k_err("Unable to allocate array for buffers.\n");
return -ENOMEM;
}
// Now copy pbuf_list from user space to kernel space
if (copy_from_user(trans.pbuf_list, user_trans->pbuf_list, size) != 0) { // (****)
kfree(trans.pbuf_list);
return -EFAULT;
}
}
break;
}
}
You're directly accessing userspace data (user_trans->pbuf_list). You should use the one that you've already copied to kernel (trans.pbuf_list).
Code for this would normally be something like:
sg_param_t *local_copy = kmalloc(size, ...);
// TODO check it succeeded
if (copy_from_user(local_copy, trans.pbuf_list, size) ...)
trans.pbuf_list = local_copy;
// use trans.pbuf_list
Note that you also need to check trans.num_of_elements to be valid (0 would make kmalloc return ZERO_SIZE_PTR, and too big value might be a way for DoS).

How to replace read function for procfs entry that returned EOF and byte count read both?

I am working on updating our kernel drivers to work with linux kernel 4.4.0 on Ubuntu 16.0.4. The drivers last worked with linux kernel 3.9.2.
In one of the modules, we have a procfs entries created to read/write the on-board fan monitoring values. Fan monitoring is used to read/write the CPU or GPU temperature/modulation,etc. values.
The module is using the following api to create procfs entries:
struct proc_dir_entry *create_proc_entry(const char *name, umode_t
mode,struct proc_dir_entry *parent);
Something like:
struct proc_dir_entry * proc_entry =
create_proc_entry("fmon_gpu_temp",0644,proc_dir);
proc_entry->read_proc = read_proc;
proc_entry->write_proc = write_proc;
Now, the read_proc is implemented something in this way:
static int read_value(char *buf, char **start, off_t offset, int count, int *eof, void *data) {
int len = 0;
int idx = (int)data;
if(idx == TEMP_FANCTL)
len = sprintf (buf, "%d.%02d\n", fmon_readings[idx] / TEMP_SAMPLES,
fmon_readings[idx] % TEMP_SAMPLES * 100 / TEMP_SAMPLES);
else if(idx == TEMP_CPU) {
int i;
len = sprintf (buf, "%d", fmon_readings[idx]);
for( i=0; i < FCTL_MAX_CPUS && fmon_cpu_temps[i]; i++ ) {
len += sprintf (buf+len, " CPU%d=%d",i,fmon_cpu_temps[i]);
}
len += sprintf (buf+len, "\n");
}
else if(idx >= 0 && idx < READINGS_MAX)
len = sprintf (buf, "%d\n", fmon_readings[idx]);
*eof = 1;
return len;
}
This read function definitely assumes that the user has provided enough buffer space to store the temperature value. This is correctly handled in userspace program. Also, for every call to this function the read value is in totality and therefore there is no support/need for subsequent reads for same temperature value.
Plus, if I use "cat" program on this procfs entry from shell, the 'cat' program correctly displays the value. This is supported, I think, by the setting of EOF to true and returning read bytes count.
New linux kernels do not support this API anymore.
My question is:
How can I change this API to new procfs API structure keeping the functionality same as: every read should return the value, program 'cat' should also work fine and not go into infinite loop ?
The primary user interface for read files on Linux is read(2). Its pair in kernel space is .read function in struct file_operations.
Every other mechanism for read file in kernel space (read_proc, seq_file, etc.) is actually an (parametrized) implementation of .read function.
The only way for kernel to return EOF indicator to user space is returning 0 as number of bytes read.
Even read_proc implementation you have for 3.9 kernel actually implements eof flag as returning 0 on next invocation. And cat actually perfoms the second invocation of read for find that file is end.
(Moreover, cat performs more than 2 invocations of read: first with 1 as count, second with count equal to page size minus 1, and the last with remaining count.)
The simplest way for "one-shot" read implementation is using seq_file in single_open() mode.

I/O to device from kernel module fails with EFAULT

I have created block device in kernel module. When some I/O happens I read/write all data from/to another existing device (let's say /dev/sdb).
It opens OK, but read/write operations return 14 error(EFAULT,Bad Address). After some research I found that I need map address to user space(probably buffer or filp variables), but copy_to_user function does not help. Also I looked to mmap() and remap_pfn_range() functions, but I can not get how to use them in my code, especially where to get correct vm_area_struct structure. All examples that I found, used char devices and file_operations structure, not block device.
Any hints? Thanks for help.
Here is my code for reading:
mm_segment_t old_fs;
old_fs = get_fs();
set_fs(KERNEL_DS);
filp = filp_open("/dev/sdb", O_RDONLY | O_DIRECT | O_SYNC, 00644);
if(IS_ERR(filp))
{
set_fs(old_fs);
int err = PTR_ERR(filp);
printk(KERN_ALERT"Can not open file - %d", err);
return;
}
else
{
bytesRead = vfs_read(filp, buffer, nbytes, &offset); //It gives 14 error
filp_close(filp, NULL);
}
set_fs(old_fs);
I found a better way for I/O to block device from kernel module. I have used bio structure for that. Hope this information save somebody from headache.
1) So, if you want to redirect I/O from your block device to existing block device, you have to use own make_request function. For that you should use blk_alloc_queue function to create queue for your block device like this:
device->queue = blk_alloc_queue(GFP_KERNEL);
blk_queue_make_request(device->queue, own_make_request);
Than into own_make_request function change bi_bdev member into bio structure to device in which you redirecting I/O and call generic_make_request function:
bio->bi_bdev = device_in_which_redirect;
generic_make_request(bio);
More information here at 16 chapter. If link is broken by some cause, here is name of the book - "Linux Device Drivers, Third Edition"
2) If you want read or write your own data to existing block device from kernel module you should use submit_bio function.
Code for writing into specific sector(you need to implement writeComplete function also):
void writePage(struct block_device *device,
sector_t sector, int size, struct page *page)
{
struct bio *bio = bio_alloc(GFP_NOIO, 1);
bio->bi_bdev = vnode->blkDevice;
bio->bi_sector = sector;
bio_add_page(bio, page, size, 0);
bio->bi_end_io = writeComplete;
submit_bio(WRITE_FLUSH_FUA, bio);
}
Code for reading from specific sector(you need to implement readComplete function also):
int readPage(struct block_device *device, sector_t sector, int size,
struct page *page)
{
int ret;
struct completion event;
struct bio *bio = bio_alloc(GFP_NOIO, 1);
bio->bi_bdev = device;
bio->bi_sector = sector;
bio_add_page(bio, page, size, 0);
init_completion(&event);
bio->bi_private = &event;
bio->bi_end_io = readComplete;
submit_bio(READ | REQ_SYNC, bio);
wait_for_completion(&event);
ret = test_bit(BIO_UPTODATE, &bio->bi_flags);
bio_put(bio);
return ret;
}
page can be allocated with alloc_page(GFP_KERNEL). Also for changing data in page use page_address(page). It returns void* so you can interpret that pointer as whatever you want.

Resources