swapoff from inside kernel code - linux-kernel

kernel newbie here...
I'm trying to do a swapoff from inside kernel code (on a swap device at a known location, suitable for hardcoding). I found the syscall sys_swapoff, which looks fairly straigtforward, so I tried just doing:
sys_swapoff("/path/to/swap/device");
but that doesn't work (it returns error no -14). Using ghetto-style debugging via printk, I've determined that it's erroring out on this codeblock in sys_swapoff:
pathname = getname(specialfile);
err = PTR_ERR(pathname);
if (IS_ERR(pathname))
goto out;
So apparently it doesn't like something about the pathname I'm giving it. I thought maybe it was because I was passing it a string literal instead of an allocated buffer, so I tried kmallocing a buffer, strcpying the path into it, and passing it that, but that made no difference. What am I doing wrong? Is there a better way to do a swapoff from inside kernel code other than using the syscall?

Are you specifying the path to the specific numbered partition (e.g. sda1 vs sda) as part of your path? Can you provide the specific value you used?
Actually, if you're trying to do this inside kernel code -- sys_swapoff expects the parameter being passed in to be from userspace, so you probably need to decompose sys_swapoff and do some of that work yourself in your code.

There's existing code in the TuxOnIce patch set (widely used for hibernation support in various Linux distros) that enables / disables swap when needed, i.e. when creating a hibernation image and/or resuming.
What this code does (check kernel/power/tuxonice_swap.c in the patch sources) is exactly the same as you do - sys_swapoff(swapfilename); - and it's functional. So there's nothing wrong with the invocation from kernel space as such.
How sure are you about your device pathname ? Have you instrumented sys_swapon() and sys_swapoff() so that they print what is actually passed when you manually on the command line issue swapon / swapoff commands ? udev & friends and/or the usage of initramfs sometimes result in device pathnames that aren't universally valid.
Edit:
Since you just stated in a comment you're attempting /dev/block/... - that might well be your cause; the path is an artifact, just after boot these device nodes exist directly in /dev/ (like /dev/mmcblk0 becomes /dev/block/mmcblk0 later). Try /dev/zram0 and see what happens.

Related

Unable to change gpio value

Currently I'm trying to check the booting time of an Tixi board using systemd on a 2.6.39 linux kernel. To do so I created a service file that calls a bash script which sets and uses a gpio. The problem is that my systems is not allowing me to change the value of the gpio. I can sucessfully export it, change its direction, but NOT the value. I have connected an oscilloscope to check if the value had changed in the hardware but not updated in the file as suggested in some forums, but it was the same: the value just doesn't change!
I should also point out that the same script is working if I use system V, with exactly the same coonfiguration for the kernel, busybox and filesystem.
It is very ironic because I'm already the root of the systems, nevertheless even changing the permissions of the file, would not allow me to change its value. There is also no feedback from the kernel saying that the operation was not possible, but rather it looks as if it was possible but when I check the value, it was the same as before.
I also tried to run that in the Raspbian with a 3.12 (which I changed to systemd) and it was in fact possible to do it, just in the normal way from userspace.
I would appreciate if you have any idea oh what might be the problem since I already run out of ideas.
Thanks
PS: This is the code that should work on the bash line:
echo 0 > /sys/class/gpio/gpio104/value
more /sys/class/gpio/gpio104/value
// I get 1 not 0 as I requested
Nevertheless the same lines of code in the same board work if I use systemV but not if I use systemd
Probably cause by the lack of udev in your new setup which change the permission for those gpio in /sys/class. You might want to just put back udev to see if it fixes your problem.
I don't know your image setting, but each gpio pins needs to be exported prior to usage. Are you doing it or it's done automatically? If you have omap mux kernel switch, you do something like :
echo 0x104 > /sys/kernel/debug/omap_mux/cam_d5 (set mode 4 as stipulate in TI Sitara TRM)
echo 104 > /sys/class/gpio/export (export the pin)
echo out > /sys/class/gpio/gpio104/direction (set the pin as output)
Also do a dmesg | grep gpio and see if there's any initializing problem with the gpio mux.
Actually I've faced an issue similar to your's , ie was not able to change the value of set of gpio pin manually
Finally the result obtained was even though the name of that pin is gpio it can only be used for input only (DM3730 gpiO_114 and gpio_115).
So please refer to the datasheet and confirm it can be used for I/O operations..

How to set intel_idle.max_cstate=0 to disable c-states?

I would like to disable c-states on my computer.
I disabled c-state on BIOS but I don't obtain any result. However, I found an explanation :
"Most newer Linux distributions, on systems with Intel processors, use the “intel_idle” driver (probably compiled into your kernel and not a separate module) to use C-states. This driver uses knowledge of the various CPUs to control C-states without input from system firmware (BIOS). This driver will mostly ignore any other BIOS setting and kernel parameters"
I found two solutions to solve this problem but I don't know how to apply:
1) " so if you want control over C-states, you should use kernel parameter “intel_idle.max_cstate=0” to disable this driver."
I don't know neither how I can check the value (of intel_idle.max_cstate ) and neither how I can change its value.
2) "To dynamically control C-states, open the file /dev/cpu_dma_latency and write the maximum allowable latency to it. This will prevent C-states with transition latencies higher than the specified value from being used, as long as the file /dev/cpu_dma_latency is kept open. Writing a maximum allowable latency of 0 will keep the processors in C0"
I can't read the file cpu_dma_latency.
Thanks for your help.
Computer:
Intel Xeon CPU E5-2620
Gnome 2.28.2
Linux 2.6.32-358
To alter the value at boot time, you can modify the GRUB configuration or edit it on the fly -- the method to modify that varies by distribution. This is the Ubuntu documentation to change kernel parameters either for a single boot, or permanently. For a RHEL-derived distribution, I don't see docs that are quite as clear, but you directly modify /boot/grub/grub.conf to include the parameter on the "kernel" lines for each bootable stanza.
For the second part of the question, many device files are read-only or write-only. You could use a small perl script like this (untested and not very clean, but should work) to keep the file open:
#!/usr/bin/perl
use FileHandle;
my $fd = open (">/dev/cpu_dma_latency");
print $fd "0";
print "Press CTRL-C to end.\n";
while (1) {
sleep 5;
}
Redhat has a C snippet in a KB article here as well and more description of the parameter.

what is the purpose of the BeingDebugged flag in the PEB structure?

What is the purpose of this flag (from the OS side)?
Which functions use this flag except isDebuggerPresent?
thanks a lot
It's effectively the same, but reading the PEB doesn't require a trip through kernel mode.
More explicitly, the IsDebuggerPresent API is documented and stable; the PEB structure is not, and could, conceivably, change across versions.
Also, the IsDebuggerPresent API (or flag) only checks for user-mode debuggers; kernel debuggers aren't detected via this function.
Why put it in the PEB? It saves some time, which was more important in early versions of NT. (There are a bunch of user-mode functions that check this flag before doing some runtime validation, and will break to the debugger if set.)
If you change the PEB field to 0, then IsDebuggerPresent will also return 0, although I believe that CheckRemoteDebuggerPresent will not.
As you have found the IsDebuggerPresent flag reads this from the PEB. As far as I know the PEB structure is not an official API but IsDebuggerPresent is so you should stick to that layer.
The uses of this method are quite limited if you are after a copy protection to prevent debugging your app. As you have found it is only a flag in your process space. If somebody debugs your application all he needs to do is to zero out the flag in the PEB table and let your app run.
You can raise the level by using the method CheckRemoteDebuggerPresent where you pass in your own process handle to get an answer. This method goes into the kernel and checks for the existence of a special debug structure which is associated with your process if it is beeing debugged. A user mode process cannot fake this one but you know there are always ways around by simply removing your check ....

Change user space memory protection flags from kernel module

I am writing a kernel module that has access to a particular process's memory. I have done an anonymous mapping on some of the user space memory with do_mmap():
#define MAP_FLAGS (MAP_PRIVATE | MAP_FIXED | MAP_ANONYMOUS)
prot = PROT_WRITE;
retval = do_mmap(NULL, vaddr, vsize, prot, MAP_FLAGS, 0);
vaddr and vsize are set earlier, and the call succeeds. After I write to that memory block from the kernel module (via copy_to_user), I want to remove the PROT_WRITE permission on it (like I would with mprotect in normal user space). I can't seem to find a function that will allow this.
I attempted unmapping the region and remapping it with the correct protections, but that zeroes out the memory block, erasing all the data I just wrote; setting MAP_UNINITIALIZED might fix that, but, from the man pages:
MAP_UNINITIALIZED (since Linux 2.6.33)
Don't clear anonymous pages. This flag is intended to improve performance on embedded
devices. This flag is only honored if the kernel was configured with the
CONFIG_MMAP_ALLOW_UNINITIALIZED option. Because of the security implications, that option
is normally enabled only on embedded devices (i.e., devices where one has complete
control of the contents of user memory).
so, while that might do what I want, it wouldn't be very portable. Is there a standard way to accomplish what I've suggested?
After some more research, I found a function called get_user_pages() (best documentation I've found is here) that returns a list of pages from userspace at a given address that can be mapped to kernel space with kmap() and written to that way (in my case, using kernel_read()). This can be used as a replacement for copy_to_user() because it allows forcing write permissions on the pages retrieved. The only drawback is that you have to write page by page, instead of all in one go, but it does solve the problem I described in my question.
In userspace there is a system call mprotect that can modify the protection flags on existing mapping. You probably need to follow from the implementation of that system call, or maybe simply call it directly from your code. See mm/protect.c.

Problem with .release behavior in file_operations

I'm dealing with a problem in a kernel module that get data from userspace using a /proc entry.
I set open/write/release entries for my own defined /proc entry, and manage well to use it to get data from userspace.
I handle errors in open/write functions well, and they are visible to user as open/fopen or write/fwrite/fprintf errors.
But some of the errors can only be checked at close (because it's the time all the data is available). In these cases I return something different than 0, which I supposed to be in some way the value 'close' or 'fclose' will return to user.
But whatever the value I return my close behave like if all is fine.
To be sure I replaced all the release() code by a simple 'return(-1);' and wrote a program that open/write/close the /proc entry, and prints the close return value (and the errno). It always return '0' whatever the value I give.
Behavior is the same with 'fclose', or by using shell mechanism (echo "..." >/proc/my/entry).
Any clue about this strange behavior that is not the one claimed in many tutorials I found?
BTW I'm using RHEL5 kernel (2.6.18, redhat modified), on a 64bit system.
Thanks.
Regards,
Yannick
The release() isn't allowed to cause the close() to fail.
You could require your userspace programs to call fsync() on the file descriptor before close(), if they want to find out about all possible errors; then implement your final error checking in the fsync() handler.

Resources