When we are executing dd command, which write function gets called.
As per my understanding, dd command is not filesystem specific, so no file system's file_operations is involved. Please correct If I am wrong here.
I would like to know which file_operations is involved in carrying out dd operation?
That depends on what you write to.
Either it is a regular file and file system specific calls are used or it is a device and you eventually use to the target disk (or whatever) underlying driver.
http://www.makelinux.net/books/ulk3/understandlk-CHP-14-SECT-5#understandlk-CHP-14-SECT-5
The write system call does indeed end up invoking the file system specific write via the VFS layer. See the vfs_write function.
Related
i have a requirement where many threads will call same shell script to perform a work, and then will write output(data as single text line) to a common text file.
as here many threads will try to write data to same file, my question is whether unix provides a default locking mechanism so that all can not write at the same time.
Performing a short single write to a file opened for append is mostly atomic; you can get away with it most of the time (depending on your filesystem), but if you want to be guaranteed that your writes won't interrupt each other, or to write arbitrarily long strings, or to be able to perform multiple writes, or to perform a block of writes and be assured that their contents will be next to each other in the resulting file, then you'll want to lock.
While not part of POSIX (unlike the C library call for which it's named), the flock tool provides the ability to perform advisory locking ("advisory" -- as opposed to "mandatory" -- meaning that other potential writers need to voluntarily participate):
(
flock -x 99 || exit # lock the file descriptor
echo "content" >&99 # write content to that locked FD
) 99>>/path/to/shared-file
The use of file descriptor #99 is completely arbitrary -- any unused FD number can be chosen. Similarly, one can safely put the lock on a different file than the one to which content is written while the lock is held.
The advantage of this approach over several conventional mechanisms (such as using exclusive creation of a file or directory) is automatic unlock: If the subshell holding the file descriptor on which the lock is held exits for any reason, including a power failure or unexpected reboot, the lock will be automatically released.
my question is whether unix provides a default locking mechanism so
that all can not write at the same time.
In general, no. At least not something that's guaranteed to work. But there are other ways to solve your problem, such as lockfile, if you have it available:
Examples
Suppose you want to make sure that access to the file "important" is
serialised, i.e., no more than one program or shell script should be
allowed to access it. For simplicity's sake, let's suppose that it is
a shell script. In this case you could solve it like this:
...
lockfile important.lock
...
access_"important"_to_your_hearts_content
...
rm -f important.lock
...
Now if all the scripts that access "important" follow this guideline,
you will be assured that at most one script will be executing between
the 'lockfile' and the 'rm' commands.
But, there's actually a better way, if you can use C or C++: Use the low-level open call to open the file in append mode, and call write() to write your data. With no locking necessary. Per the write() man page:
If the O_APPEND flag of the file status flags is set, the file offset
shall be set to the end of the file prior to each write and no
intervening file modification operation shall occur between changing
the file offset and the write operation.
Like this:
// process-wide global file descriptor
int outputFD = open( fileName, O_WRONLY | O_APPEND, 0600 );
.
.
.
// write a string to the file
ssize_t writeToFile( const char *data )
{
return( write( outputFD, data, strlen( data ) );
}
In practice, you can write anything to the file - it doesn't have to be a NUL-terminated character string.
That's supposed to be atomic on writes up to PIPE_BUF bytes, which is usually something like 512, 4096, or 5120. Some Linux filesystems apparently don't implement that properly, so you may in practice be limited to about 1K on those file systems.
As I understand, kernel provides mainly two interface for user space to do something in kernel, these are System Call and Virtual File system (procfs, sysfs etc).
What I read in a book, that internally VFS also uses System Call.
So I want to know, how these two are connected exactly? And what are the situation where we should use VFS over System Call and vice versa.
A system call is the generic facility for any user space process to switch from user space mode to kernel mode.
It is like a function call that resides in the kernel and being invoked from user space with a variable number of parameters, the most important one is the syscall number.
The kernel will always maintain an architecture-specific array of supported system calls (=kernel functions) and will basically dispatch any syscall coming from user space to the correct function based on the system call number passed from user space.
Virtual File System is just an abstraction of a file system that provides you with standard functions to deal with any thing that can be considered a file. So for example you can call "open", "close", "read", etc. on any file without being concerned about what filesystem is this file stored in.
The relation here between VFS and syscalls is that VFS is basically code that resides in the kernel and the only way to get to the kernel is through syscalls ( "open" is a syscall, so is "close", etc )
Recently I was looking through the kernel at kobjects and sysfs.
I know/understand the following..
All kernel objects use addresses > 0x80000000
kobjects should be no exception to this rule
The sysfs is nothing but a hierarchy of kobjects (maybe includes ksets and other k* stuff..not sure)
Given this information, I'm not sure I understand exactly what happens when I run echo ondemand >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
I can see that the cpufreq module has a function called store_scaling_governor which handles writes to this 'file'..but how does usermode transcend into kernelmode with this simple echo?
When you execute command echo ondemand >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor, your shell calls write system call, then kernel dispatch it for corresponding handler.
The cpufreq setups struct kobj_type ktype_cpufreq with sysfs_ops. Then cpufreq register it in cpufreq_add_dev_interface(). After that, kernel can get corresponding handler to execute on write syscall.
I can tell you one implementation which I have used for accessing kernel space variables from sysfs (user-space in shell prompt).Basically each set of variables which are exposed to user-space in sys file system appear as a separate file under /sys/.Now when you issue an echo value > /sys/file-path in shell prompt (user-space).When you do so the respective method which gets called in kernel space in .store method.Additionally when you issue cat /sys/file-path the respective method which gets called is .show in kernel.You can see more information about here: http://lwn.net/Articles/31220/
kernel newbie here...
I'm trying to do a swapoff from inside kernel code (on a swap device at a known location, suitable for hardcoding). I found the syscall sys_swapoff, which looks fairly straigtforward, so I tried just doing:
sys_swapoff("/path/to/swap/device");
but that doesn't work (it returns error no -14). Using ghetto-style debugging via printk, I've determined that it's erroring out on this codeblock in sys_swapoff:
pathname = getname(specialfile);
err = PTR_ERR(pathname);
if (IS_ERR(pathname))
goto out;
So apparently it doesn't like something about the pathname I'm giving it. I thought maybe it was because I was passing it a string literal instead of an allocated buffer, so I tried kmallocing a buffer, strcpying the path into it, and passing it that, but that made no difference. What am I doing wrong? Is there a better way to do a swapoff from inside kernel code other than using the syscall?
Are you specifying the path to the specific numbered partition (e.g. sda1 vs sda) as part of your path? Can you provide the specific value you used?
Actually, if you're trying to do this inside kernel code -- sys_swapoff expects the parameter being passed in to be from userspace, so you probably need to decompose sys_swapoff and do some of that work yourself in your code.
There's existing code in the TuxOnIce patch set (widely used for hibernation support in various Linux distros) that enables / disables swap when needed, i.e. when creating a hibernation image and/or resuming.
What this code does (check kernel/power/tuxonice_swap.c in the patch sources) is exactly the same as you do - sys_swapoff(swapfilename); - and it's functional. So there's nothing wrong with the invocation from kernel space as such.
How sure are you about your device pathname ? Have you instrumented sys_swapon() and sys_swapoff() so that they print what is actually passed when you manually on the command line issue swapon / swapoff commands ? udev & friends and/or the usage of initramfs sometimes result in device pathnames that aren't universally valid.
Edit:
Since you just stated in a comment you're attempting /dev/block/... - that might well be your cause; the path is an artifact, just after boot these device nodes exist directly in /dev/ (like /dev/mmcblk0 becomes /dev/block/mmcblk0 later). Try /dev/zram0 and see what happens.
I'm dealing with a problem in a kernel module that get data from userspace using a /proc entry.
I set open/write/release entries for my own defined /proc entry, and manage well to use it to get data from userspace.
I handle errors in open/write functions well, and they are visible to user as open/fopen or write/fwrite/fprintf errors.
But some of the errors can only be checked at close (because it's the time all the data is available). In these cases I return something different than 0, which I supposed to be in some way the value 'close' or 'fclose' will return to user.
But whatever the value I return my close behave like if all is fine.
To be sure I replaced all the release() code by a simple 'return(-1);' and wrote a program that open/write/close the /proc entry, and prints the close return value (and the errno). It always return '0' whatever the value I give.
Behavior is the same with 'fclose', or by using shell mechanism (echo "..." >/proc/my/entry).
Any clue about this strange behavior that is not the one claimed in many tutorials I found?
BTW I'm using RHEL5 kernel (2.6.18, redhat modified), on a 64bit system.
Thanks.
Regards,
Yannick
The release() isn't allowed to cause the close() to fail.
You could require your userspace programs to call fsync() on the file descriptor before close(), if they want to find out about all possible errors; then implement your final error checking in the fsync() handler.