pid_task() causes kernel panic, linux kernel 5.4 - linux-kernel

I am trying to send a signal from the kernel space to the user space.
I have the below code and I am seeing a kernel panic.
[ 5230.132362] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: prog_irq_handler+0x1d4/0x2cc [prog_mon]
[ 5230.146795] ---[ end Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: prog_irq_handler+0x1d4/0x2cc
Upon debugging some more, I found that the source of the panic is from the below function:
t = pid_task(find_pid_ns(id, &init_pid_ns), PIDTYPE_PID);
Referencing the value of "t" seems to cause an exception, resulting in a kernel panic.
Is there any known issue with the kernel 5.4 wrt the pid_task().
Any help will be appreciated.
My kernel comes from yocto, and it is:
branch ti-linux-5.4.y
commit 6f3bf13d53820fc12432d7052744be2ee046fc92 (HEAD -> ti-linux-5.4.y)
Merge: d2f658ed506d d5ef1ab82339
Author: LCPD Auto Merger lcpd_integration#list.ti.com
Date: Fri Apr 3 10:50:48 2020 -0500
https://git.ti.com/cgit/ti-linux-kernel/ti-linux-kernel/
Full code below:
send_signal(int val, int id, int sig)
{
struct kernel_siginfo info;
struct task_struct *t;
int ret;
ret = 0;
if ((id > 0) && (sig > 0)) {
memset(&info, 0, sizeof(struct siginfo));
info.si_signo = sig;
/* Using SI_KERNEL here results in real_time data not getting delivered to the user space signal handler */
info.si_code = SI_QUEUE;
/* Real time signals may have 32 bits of data */
info.si_int = val;
info._sifields._rt._sigval.sival_int = val;
info.si_errno = 0;
rcu_read_lock();
t = pid_task(find_pid_ns(id, &init_pid_ns), PIDTYPE_PID);
if(t == NULL) {
printk(KERN_ERR "%s: Invalid user handler PID %d\n", module_name, id);
rcu_read_unlock();
return -ENODEV;
}
ret = send_sig_info(sig, &info, t);
rcu_read_unlock();
if (ret < 0)
printk(KERN_INFO "%s: Failed to signal with data %d to user space\n", module_name, val);
}
return ret
}

Related

How to chunk shell script input by time, not by size?

In a bash script I am using a many-producer single-consumer pattern. Producers are background processes writing lines into a fifo (via GNU Parallel). The consumer reads all lines from the fifo, then sorts, filters, and prints the formatted result to stdout.
However, it could take a long time until the full result is available. Producers are usually fast on the first few results but then would slow down. Here I am more interested to see chunks of data every few seconds, each sorted and filtered individually.
mkfifo fifo
parallel ... >"$fifo" &
while chunk=$(read with timeout 5s and at most 10s <"$fifo"); do
process "$chunk"
done
The loop would run until all producers are done and all input is read. Each chunk is read until there has been no new data for 5s, or until 10s have passed since the chunk was started. A chunk may also be empty if there was no new data for 10s.
I tried to make it work like this:
output=$(mktemp)
while true; do
wasTimeout=0 interruptAt=$(( $(date '+%s') + 10 ))
while true; do
IFS= read -r -t5 <>"${fifo}"
rc="$?"
if [[ "${rc}" -gt 0 ]]; then
[[ "${rc}" -gt 128 ]] && wasTimeout=1
break
fi
echo "$REPLY" >>"${output}"
if [[ $(date '+%s') -ge "${interruptAt}" ]]; then
wasTimeout=1
break
fi
done
echo '---' >>"${output}"
[[ "${wasTimeout}" -eq 0 ]] && break
done
Tried some variations of this. In the form above it reads the first chunk but then loops forever. If I use <"${fifo}" (no read/write as above) it blocks after the first chunk. Maybe all of this could be simplified with buffer and/or stdbuf? But both of them define blocks by size, not by time.
This is not a trivial problem to resolve. As I hinted, a C program (or a program in some programming language other than the shell) is probably the best solution. Some of the complicating factors are:
Reading with timeouts.
If data arrives soon enough, the timeout changes.
Different systems have different sets of interval timing functions:
alarm() is likely available everywhere, but has only 1-second resolution which is liable to accumulated rounding errors. (Compile this version with make UFLAGS=-DUSE_ALARM; on macOS, use make UFLAGS=-DUSE_ALARM LDLIB2=.)
setitimer()
uses microsecond timing and the struct timeval type. (Compile this version with make UFLAGS=-DUSE_SETITIMER; on macOS, compile with make UFLAGS=-DUSE_SETITIMER LDLIB2=.)
timer_create() and
timer_settime() etc use the modern nanosecond type struct timespec. This is available on Linux; it is not available on macOS 10.14.5 Mojave or earlier. (Compile this version with make; it won't work on macOS.)
The program usage message is:
$ chunker79 -h
Usage: chunker79 [-hvV][-c chunk][-d delay][-f file]
-c chunk Maximum time to wait for data in a chunk (default 10)
-d delay Maximum delay after line read (default: 5)
-f file Read from file instead of standard input
-h Print this help message and exit
-v Verbose mode: print timing information to stderr
-V Print version information and exit
$
This code is available in my SOQ (Stack Overflow Questions) repository on GitHub as file chunker79.c in the src/so-5631-4784 sub-directory. You will need some of the support code from the src/libsoq directory too.
/*
#(#)File: chunker79.c
#(#)Purpose: Chunk Reader for SO 5631-4784
#(#)Author: J Leffler
#(#)Copyright: (C) JLSS 2019
*/
/*TABSTOP=4*/
/*
** Problem specification from the Stack Overflow question
**
** In a bash script I am using a many-producer single-consumer pattern.
** Producers are background processes writing lines into a fifo (via GNU
** Parallel). The consumer reads all lines from the fifo, then sorts,
** filters, and prints the formatted result to stdout.
**
** However, it could take a long time until the full result is
** available. Producers are usually fast on the first few results but
** then would slow down. Here I am more interested to see chunks of
** data every few seconds, each sorted and filtered individually.
**
** mkfifo fifo
** parallel ... >"$fifo" &
** while chunk=$(read with timeout 5s and at most 10s <"$fifo"); do
** process "$chunk"
** done
**
** The loop would run until all producers are done and all input is
** read. Each chunk is read until there has been no new data for 5s, or
** until 10s have passed since the chunk was started. A chunk may also
** be empty if there was no new data for 10s.
*/
/*
** Analysis
**
** 1. If no data arrives at all for 10 seconds, then the program should
** terminate producing no output. This timeout is controlled by the
** value of time_chunk in the code.
** 2. If data arrives more or less consistently, then the collection
** should continue for 10s and then finish. This timeout is also
** controlled by the value of time_chunk in the code.
** 3. If a line of data arrives before 5 seconds have elapsed, and no
** more arrives for 5 seconds, then the collection should finish.
** (If the first line arrives after 5 seconds and no more arrives
** for more than 5 seconds, then the 10 second timeout cuts in.)
** This timeout is controlled by the value of time_delay in the code.
** 4. This means that we want two separate timers at work:
** - Chunk timer (started when the program starts).
** - Delay timer (started each time a line is read).
**
** It doesn't matter which timer goes off, but further timer signals
** should be ignored. External signals will confuse things; tough!
**
** -- Using alarm(2) is tricky because it provides only one time, not two.
** -- Using getitimer(2), setitimer(2) uses obsolescent POSIX functions,
** but these are available on macOS.
** -- Using timer_create(2), timer_destroy(2), timer_settime(2),
** timer_gettime(2) uses current POSIX function but is not available
** on macOS.
*/
#include "posixver.h"
#include "stderr.h"
#include "timespec_io.h"
#include <assert.h>
#include <signal.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/uio.h>
#include <time.h>
#include <unistd.h>
#ifdef USE_SETITIMER
#include "timeval_math.h"
#include "timeval_io.h"
#include <sys/time.h>
#endif /* USE_SETITIMER */
static const char optstr[] = "hvVc:d:f:";
static const char usestr[] = "[-hvV][-c chunk][-d delay][-f file]";
static const char hlpstr[] =
" -c chunk Maximum time to wait for data in a chunk (default 10)\n"
" -d delay Maximum delay after line read (default: 5)\n"
" -f file Read from file instead of standard input\n"
" -h Print this help message and exit\n"
" -v Verbose mode: print timing information to stderr\n"
" -V Print version information and exit\n"
;
static struct timespec time_delay = { .tv_sec = 5, .tv_nsec = 0 };
static struct timespec time_chunk = { .tv_sec = 10, .tv_nsec = 0 };
static struct timespec time_start;
static bool verbose = false;
static void set_chunk_timeout(void);
static void set_delay_timeout(void);
static void cancel_timeout(void);
static void alarm_handler(int signum);
// Using signal() manages to set SA_RESTART on a Mac.
// This is allowed by standard C and POSIX, sadly.
// signal(SIGALRM, alarm_handler);
#if defined(USE_ALARM)
static void set_chunk_timeout(void)
{
if (verbose)
err_remark("-->> %s()\n", __func__);
alarm(time_chunk.tv_sec);
struct sigaction sa;
sa.sa_handler = alarm_handler;
sigemptyset(&sa.sa_mask);
sa.sa_flags = 0;
sigaction(SIGALRM, &sa, NULL);
if (verbose)
err_remark("<<-- %s()\n", __func__);
}
static void set_delay_timeout(void)
{
if (verbose)
err_remark("-->> %s()\n", __func__);
unsigned time_left = alarm(0);
if (time_left > time_delay.tv_sec)
alarm(time_delay.tv_sec);
else
alarm(time_left);
if (verbose)
err_remark("<<-- %s()\n", __func__);
}
static void cancel_timeout(void)
{
if (verbose)
err_remark("-->> %s()\n", __func__);
alarm(0);
signal(SIGALRM, SIG_IGN);
if (verbose)
err_remark("<<-- %s()\n", __func__);
}
#elif defined(USE_SETITIMER)
static inline struct timeval cvt_timespec_to_timeval(struct timespec ts)
{
return (struct timeval){ .tv_sec = ts.tv_sec, .tv_usec = ts.tv_nsec / 1000 };
}
static void set_chunk_timeout(void)
{
if (verbose)
err_remark("-->> %s()\n", __func__);
struct itimerval tv_new = { { 0, 0 }, { 0, 0 } };
tv_new.it_value = cvt_timespec_to_timeval(time_chunk);
struct itimerval tv_old;
if (setitimer(ITIMER_REAL, &tv_new, &tv_old) != 0)
err_syserr("failed to set interval timer: ");
struct sigaction sa;
sa.sa_handler = alarm_handler;
sigemptyset(&sa.sa_mask);
sa.sa_flags = 0;
sigaction(SIGALRM, &sa, NULL);
if (verbose)
err_remark("<<-- %s()\n", __func__);
}
static void set_delay_timeout(void)
{
if (verbose)
err_remark("-->> %s()\n", __func__);
struct itimerval tv_until;
if (getitimer(ITIMER_REAL, &tv_until) != 0)
err_syserr("failed to set interval timer: ");
struct timeval tv_delay = cvt_timespec_to_timeval(time_delay);
if (verbose)
{
char buff1[32];
fmt_timeval(&tv_delay, 6, buff1, sizeof(buff1));
char buff2[32];
fmt_timeval(&tv_until.it_value, 6, buff2, sizeof(buff2));
err_remark("---- %s(): delay %s, left %s\n", __func__, buff1, buff2);
}
if (cmp_timeval(tv_until.it_value, tv_delay) <= 0)
{
if (verbose)
err_remark("---- %s(): no need for delay timer\n", __func__);
}
else
{
struct itimerval tv_new = { { 0, 0 }, { 0, 0 } };
tv_new.it_value = cvt_timespec_to_timeval(time_delay);
struct itimerval tv_old;
if (setitimer(ITIMER_REAL, &tv_new, &tv_old) != 0)
err_syserr("failed to set interval timer: ");
if (verbose)
err_remark("---- %s(): set delay timer\n", __func__);
}
if (verbose)
err_remark("<<-- %s()\n", __func__);
}
static void cancel_timeout(void)
{
if (verbose)
err_remark("-->> %s()\n", __func__);
struct itimerval tv_new =
{
.it_value = { .tv_sec = 0, .tv_usec = 0 },
.it_interval = { .tv_sec = 0, .tv_usec = 0 },
};
struct itimerval tv_old;
if (setitimer(ITIMER_REAL, &tv_new, &tv_old) != 0)
err_syserr("failed to set interval timer: ");
if (verbose)
err_remark("<<-- %s()\n", __func__);
}
#else /* USE_TIMER_GETTIME */
#include "timespec_math.h"
static timer_t t0 = { 0 };
static void set_chunk_timeout(void)
{
if (verbose)
err_remark("-->> %s()\n", __func__);
struct sigevent ev =
{
.sigev_notify = SIGEV_SIGNAL,
.sigev_signo = SIGALRM,
.sigev_value.sival_int = 0,
.sigev_notify_function = 0,
.sigev_notify_attributes = 0,
};
if (timer_create(CLOCK_REALTIME, &ev, &t0) < 0)
err_syserr("failed to create a timer: ");
struct itimerspec it =
{
.it_interval = { .tv_sec = 0, .tv_nsec = 0 },
.it_value = time_chunk,
};
struct itimerspec ot;
if (timer_settime(t0, 0, &it, &ot) != 0)
err_syserr("failed to activate timer: ");
struct sigaction sa;
sa.sa_handler = alarm_handler;
sigemptyset(&sa.sa_mask);
sa.sa_flags = 0;
sigaction(SIGALRM, &sa, NULL);
if (verbose)
err_remark("<<-- %s()\n", __func__);
}
static void set_delay_timeout(void)
{
if (verbose)
err_remark("-->> %s()\n", __func__);
struct itimerspec time_until;
if (timer_gettime(t0, &time_until) != 0)
err_syserr("failed to set per-process timer: ");
char buff1[32];
fmt_timespec(&time_delay, 6, buff1, sizeof(buff1));
char buff2[32];
fmt_timespec(&time_until.it_value, 6, buff2, sizeof(buff2));
err_remark("---- %s(): delay %s, left %s\n", __func__, buff1, buff2);
if (cmp_timespec(time_until.it_value, time_delay) <= 0)
{
if (verbose)
err_remark("---- %s(): no need for delay timer\n", __func__);
}
else
{
struct itimerspec time_new =
{
.it_interval = { .tv_sec = 0, .tv_nsec = 0 },
.it_value = time_delay,
};
struct itimerspec time_old;
if (timer_settime(t0, 0, &time_new, &time_old) != 0)
err_syserr("failed to set per-process timer: ");
if (verbose)
err_remark("---- %s(): set delay timer\n", __func__);
}
if (verbose)
err_remark("<<-- %s()\n", __func__);
}
static void cancel_timeout(void)
{
if (timer_delete(t0) != 0)
err_syserr("failed to delete timer: ");
}
#endif /* Timing mode */
/* Writing to stderr via err_remark() is not officially supported */
static void alarm_handler(int signum)
{
assert(signum == SIGALRM);
if (verbose)
err_remark("---- %s(): signal %d\n", __func__, signum);
}
static void read_chunks(FILE *fp)
{
size_t num_data = 0;
size_t max_data = 0;
struct iovec *data = 0;
size_t buflen = 0;
char *buffer = 0;
ssize_t length;
size_t chunk_len = 0;
clock_gettime(CLOCK_REALTIME, &time_start);
set_chunk_timeout();
while ((length = getline(&buffer, &buflen, fp)) != -1)
{
if (num_data >= max_data)
{
size_t new_size = (num_data * 2) + 2;
void *newspace = realloc(data, new_size * sizeof(data[0]));
if (newspace == 0)
err_syserr("failed to allocate %zu bytes data: ", new_size * sizeof(data[0]));
data = newspace;
max_data = new_size;
}
data[num_data].iov_base = buffer;
data[num_data].iov_len = length;
num_data++;
if (verbose)
err_remark("Received line %zu\n", num_data);
chunk_len += length;
buffer = 0;
buflen = 0;
set_delay_timeout();
}
cancel_timeout();
if (chunk_len > 0)
{
if ((length = writev(STDOUT_FILENO, data, num_data)) < 0)
err_syserr("failed to write %zu bytes to standard output: ", chunk_len);
else if ((size_t)length != chunk_len)
err_error("failed to write %zu bytes to standard output "
"(short write of %zu bytes)\n", chunk_len, (size_t)length);
}
if (verbose)
err_remark("---- %s(): data written (%zu bytes)\n", __func__, length);
for (size_t i = 0; i < num_data; i++)
free(data[i].iov_base);
free(data);
free(buffer);
}
int main(int argc, char **argv)
{
const char *name = "(standard input)";
FILE *fp = stdin;
err_setarg0(argv[0]);
err_setlogopts(ERR_MICRO);
int opt;
while ((opt = getopt(argc, argv, optstr)) != -1)
{
switch (opt)
{
case 'c':
if (scn_timespec(optarg, &time_chunk) != 0)
err_error("Failed to convert '%s' into a time value\n", optarg);
break;
case 'd':
if (scn_timespec(optarg, &time_delay) != 0)
err_error("Failed to convert '%s' into a time value\n", optarg);
break;
case 'f':
if ((fp = fopen(optarg, "r")) == 0)
err_syserr("Failed to open file '%s' for reading: ", optarg);
name = optarg;
break;
case 'h':
err_help(usestr, hlpstr);
/*NOTREACHED*/
case 'v':
verbose = true;
break;
case 'V':
err_version("CHUNKER79", &"#(#)$Revision$ ($Date$)"[4]);
/*NOTREACHED*/
default:
err_usage(usestr);
/*NOTREACHED*/
}
}
if (optind != argc)
err_usage(usestr);
if (verbose)
{
err_remark("chunk: %3lld.%09ld\n", (long long)time_chunk.tv_sec, time_chunk.tv_nsec);
err_remark("delay: %3lld.%09ld\n", (long long)time_delay.tv_sec, time_delay.tv_nsec);
err_remark("file: %s\n", name);
}
read_chunks(fp);
return 0;
}
My SOQ repository also has a script gen-data.sh which makes use of some custom programs to generate a data stream such as this (the seed value is written to standard error, not standard output):
$ gen-data.sh
# Seed: 1313715286
2019-06-03 23:04:16.653: Zunmieoprri Rdviqymcho 5878 2017-03-29 03:59:15 Udransnadioiaeamprirteo
2019-06-03 23:04:18.525: Rndflseoevhgs Etlaevieripeoetrnwkn 9500 2015-12-18 10:49:15 Ebyrcoebeezatiagpleieoefyc
2019-06-03 23:04:20.526: Nrzsuiakrooab Nbvliinfqidbujoops 1974 2020-05-13 08:05:14 Lgithearril
2019-06-03 23:04:21.777: Eeagop Aieneose 6533 2016-11-06 22:51:58 Aoejlwebbssroncmeovtuuueigraa
2019-06-03 23:04:23.876: Izirdoeektau Atesltiybysaclee 4557 2020-09-13 02:24:46 Igrooiaauiwtna
2019-06-03 23:04:26.145: Yhioit Eamrexuabagsaraiw 9703 2014-09-13 07:44:12 Dyiiienglolqopnrbneerltnmsdn
^C
$
When fed into chunker79 with default options, I get output like:
$ gen-data.sh | chunker79
# Seed: 722907235
2019-06-03 23:06:20.570: Aluaezkgiebeewal Oyvahee 1022 2015-08-12 07:45:54 Weuababeeduklleym
2019-06-03 23:06:24.100: Gmujvoyevihvoilc Negeiiuvleem 8196 2015-08-29 21:15:15 Nztkrvsadeoeagjgoyotvertavedi
$
If you analyze the time intervals (look at the first two fields in the output lines), that output meets the specification. A still more detailed analysis is shown by:
$ timecmd -mr -- gen-data.sh | timecmd -mr -- chunker79
2019-06-03 23:09:14.246 [PID 57159] gen-data.sh
2019-06-03 23:09:14.246 [PID 57160] chunker79
# Seed: -1077610201
2019-06-03 23:09:14.269: Woreio Rdtpimvoscttbyhxim 7893 2017-03-12 12:46:57 Uywaietirkekes
2019-06-03 23:09:16.939: Uigaba Nzoxdeuisofai 3630 2017-11-16 09:28:59 Jnsncgoesycsevdscugoathusaoq
2019-06-03 23:09:17.845: Sscreua Aloaoonnsuur 5163 2016-08-13 19:47:15 Injhsiifqovbnyeooiimitaaoir
2019-06-03 23:09:19.272 [PID 57160; status 0x0000] - 5.026s - chunker79
2019-06-03 23:09:22.084 [PID 57159; status 0x8D00] - 7.838s - gen-data.sh
$
There is a noticeable pause in this setup between when the output from chunker79 appears and when gen-data.sh completes. That's due to Bash waiting on all processes in the pipeline to complete, and gen-data.sh doesn't complete until the next time it writes to the pipe after the message that finishes chunker79. This is an artefact of this test setup; it wouldn't be a factor in the shell script outlined in the question.
I would consider writing a safe multi-threaded program with queues.
I know Java better, but there might be more modern suitable languages like Go and Kotlin.
Something like this:
#!/usr/bin/perl
$timeout = 3;
while(<STDIN>) {
# Make sure there is some input
push #out,$_;
eval {
local $SIG{ALRM} = sub { die };
alarm $timeout;
while(<STDIN>) {
alarm $timeout;
push #out,$_;
}
alarm 0;
};
system "echo","process",#out;
}
GNU Parallel 20200122 introduced --blocktimeout (--bt):
find ~ | parallel -j3 --bt 2s --pipe wc
This works like normal GNU Parallel except if it takes > 2 seconds to fill a block. In that case the block read so far is simply passed to wc (unless it is empty).
It has a slightly odd startup behaviour: You have to wait 3*2s (jobslots*timeout) before the output stabilizes, and you get an output at least every 2s.

Copy structure with included user pointers from user space to kernel space (copy_from_user)

I want to transfer a transaction structure, which contains an user space pointer to an array, to kernel by using copy_from_user.
The goal is, to get access to the array elements in kernel space.
User space side:
I allocate an array of _sg_param structures in user space. Now i put the address of this array in a transaction structure (line (*)).
Then i transfer the transaction structure to the kernel via ioctl().
Kernel space side:
On executing this ioctl, the complete transaction structure is copied to kernel space (line ()). Now kernel space is allocated for holding the array (line (*)). Then i try to copy the array from user space to the new allocated kernel space (line (****)), and here start my problems:
The kernel is corrupted during execution of this copy. dmesg shows following output:
[ 54.443106] Unhandled fault: page domain fault (0x01b) at 0xb6f09738
[ 54.448067] pgd = ee5ec000
[ 54.449465] [b6f09738] *pgd=2e9d7831, *pte=2d56875f, *ppte=2d568c7f
[ 54.454411] Internal error: : 1b [#1] PREEMPT SMP ARM
Any ideas ???
Following an simplified extract of my code:
// structure declaration
typedef struct _sg_param {
void *seg_buf;
int seg_len;
int received;
} sg_param_t;
struct transaction {
...
int num_of_elements;
sg_param_t *pbuf_list; // Array of sg_param structure
...
} trans;
// user space side:
if ((pParam = (sg_param_t *) malloc(NR_OF_STRUCTS * sizeof(sg_param_t))) == NULL) {
return -ENOMEM;
}
else {
trans.num_of_elements = NR_OF_STRUCTS;
trans.pbuf_list = pParam; // (*)
}
rc = ioctl(dev->fd, MY_CMD, &trans);
if (rc < 0) {
return rc;
}
// kernel space side
static long ioctl(struct file *file, unsigned int cmd, unsigned long arg)
{
arg_ptr = (void __user *)arg;
// Perform the specified command
switch (cmd) {
case MY_CMD:
{
struct transaction *__user user_trans;
user_trans = (struct transaction *__user)arg_ptr;
if (copy_from_user(&trans, arg_ptr, sizeof(trans)) != 0) { // (**)
k_err("Unable to copy transfer info from userspace for "
"AXIDMA_DMA_START_DMA.\n");
return -EFAULT;
}
int size = trans.num_of_elements * sizeof(sg_param_t);
if (trans.pbuf_list != NULL) {
// Allocate kernel memory for buf_list
trans.pbuf_list = (sg_param_t *) kmalloc(size, GFP_KERNEL); // (***)
if (trans.pbuf_list == NULL) {
k_err("Unable to allocate array for buffers.\n");
return -ENOMEM;
}
// Now copy pbuf_list from user space to kernel space
if (copy_from_user(trans.pbuf_list, user_trans->pbuf_list, size) != 0) { // (****)
kfree(trans.pbuf_list);
return -EFAULT;
}
}
break;
}
}
You're directly accessing userspace data (user_trans->pbuf_list). You should use the one that you've already copied to kernel (trans.pbuf_list).
Code for this would normally be something like:
sg_param_t *local_copy = kmalloc(size, ...);
// TODO check it succeeded
if (copy_from_user(local_copy, trans.pbuf_list, size) ...)
trans.pbuf_list = local_copy;
// use trans.pbuf_list
Note that you also need to check trans.num_of_elements to be valid (0 would make kmalloc return ZERO_SIZE_PTR, and too big value might be a way for DoS).

on PowerPC P2040/E500mc the LD instruction EA causing kernel panic during PCI card pull out

Everything I have read so far points to the fact that when accessing PCI address space during card pull out will cause kernel panic if not handled in the kernel machine_check_handler. The machine_check_handler for e500mc looks for the EA(Effective Address) of the instruction in the MCSRR0 register and compares it agains PCI address space. However, since this address (EA) was not in PCI address space, caused the kernel panic eventually, as it could not be handled in the machine check interrupt handler as the address was some bad address that was stored by CPU in the MCSRR0.
Although the GPRs are all pointing to PCI address space BAR addresses from previous cpu instructions, but the Effective Address stored in the MCSRR0 register is the same invalid physical address that the NIP is pointing to...
The MCSRR1 points to machine state (MSR) at the point of interrupt and shows LD|GLD bits set along with MCSRR1[RI] bit. so its a recoverable synchronous interrupt.
And since the CPU address access was on an external hot-plugged device we need not crash the system even if the device is not present and hence the kernel check and safe return from interrupt.
I have a few questions regarding this issue:
Which GPRs are used to determine the effective address of the LD instruction. The LD bit is set in MCSR register? How do I tell which addressing mode was used for generating the effective address for the LD instruction?
the LD instruction uses rD,rA,rB operands, how do i find which EA calculation mode is being used by the processor. Apparently there are 4 of them. Also, which GPR's do each of these operands point to? I couldn't figure it out from the E500MCRM or powepc EREF.
Since we are writing to PCI address space from user space, the PCI device registers are mapped to some virtual address space in the process memory to which we are writing. This is non cached mapping as far as i know.
Does the CPU address translation to PCI device physical address, for accessing the PCI device result in bad address as the PCI device is no longer connected. My assumption for this was, since the device is no longer present the effective address returned was some junk value that caused this kernel panic. I am not sure if that's how CPU works.
Any suggestions helping my understanding are welcome. this is way deep down and beyond my expertise. I have gone through the E500MCRM, P2040RM and powerpc EREF but I cannot figure out why I am getting a bad address instead of a PCI physical address in the Effective address.
kernel - crash dump
fujitsu:~$ fsl_pci_mcheck_exception-> SPRN_MCAR: 0x0
fsl_pci_mcheck_exception-> SPRN_MCSRR0: 0x0f6fec68
fsl_pci_mcheck_exception-> SPRN_MCSRR1: 0x2d002
fsl_pci_mcheck_exception-> SPRN_MCAR: 0x0
fsl_pci_mcheck_exception-> SPRN_DEAR: 0x0
fsl_pci_mcheck_exception-> current->pid: [8333]
fsl_pci_mcheck_exception-> after __get_user_inatomic(inst, &regs->nip): 0x0f6fec68(inst), 0x0f6fec68(regs->nip), 0x0(ret)
Machine check in kernel mode.
Caused by (from MCSR=a000): Load Error Report
Guarded Load Error Report
Oops: Machine check, sig: 7 [#1]
PREEMPT SMP NR_CPUS=4 P2041 RDB
Modules linked in: i2cBridge(O) interruptDriver_pb(O) cma_alloc(O) hwtp_drv(O) interruptDriver_wdt(O)
NIP: 0f6fec68 LR: 0f6fec4c CTR: 0f6faad4
REGS: e4ec5f10 TRAP: 0204 Tainted: G O (3.8.13-rt9+)
MSR: 0002d002 <CE,EE,PR,ME> CR: 40044442 XER: 20000000
TASK = e57dc020[8333] 'RxManager' THREAD: e4ec4000 CPU: 3
GPR00: 0f6fec4c 52afea90 52b06910 50400000 52afeb50 00000003 a0105210 52afebfc
GPR08: a1ffffff a0000000 0000000c a0000000 20044448 1032e800 52900000 00000006
GPR16: 0f74f434 0f729d20 135a78a0 00200000 0fe28280 52aff4b0 00000000 0fe2a6c8
GPR24: 52afec98 0f6cd268 135a7630 00105210 52afebfc 50400000 0f71d31c 00000003
NIP [0f6fec68] 0xf6fec68
LR [0f6fec4c] 0xf6fec4c
Call Trace:
---[ end trace 2715d0da39427f69 ]---
here's the code from fsl_pci.c that's getting called from machine_check_handler
#ifdef CONFIG_E500
static int mcheck_handle_load(struct pt_regs *regs, u32 inst)
{
unsigned int rd, ra, rb, d;
rd = get_rt(inst);
ra = get_ra(inst);
rb = get_rb(inst);
d = get_d(inst);
printk(KERN_EMERG "%s==> rd==0x%x, ra=0x%x, rb=0x%x, d=0x%x\n", __FUNCTION__, rd, ra, rb, d);
printk(KERN_EMERG "%s==> get_op(inst) = 0x%x\n", __FUNCTION__, get_op(inst));
return 1;
switch (get_op(inst)) {
case 31:
switch (get_xop(inst)) {
case OP_31_XOP_LWZX:
case OP_31_XOP_LWBRX:
regs->gpr[rd] = 0xffffffff;
break;
case OP_31_XOP_LWZUX:
regs->gpr[rd] = 0xffffffff;
regs->gpr[ra] += regs->gpr[rb];
break;
case OP_31_XOP_LBZX:
regs->gpr[rd] = 0xff;
break;
case OP_31_XOP_LBZUX:
regs->gpr[rd] = 0xff;
regs->gpr[ra] += regs->gpr[rb];
break;
case OP_31_XOP_LHZX:
case OP_31_XOP_LHBRX:
regs->gpr[rd] = 0xffff;
break;
case OP_31_XOP_LHZUX:
regs->gpr[rd] = 0xffff;
regs->gpr[ra] += regs->gpr[rb];
break;
default:
return 0;
}
break;
case OP_LWZ:
regs->gpr[rd] = 0xffffffff;
break;
case OP_LWZU:
regs->gpr[rd] = 0xffffffff;
regs->gpr[ra] += (s16)d;
break;
case OP_LBZ:
regs->gpr[rd] = 0xff;
break;
case OP_LBZU:
regs->gpr[rd] = 0xff;
regs->gpr[ra] += (s16)d;
break;
case OP_LHZ:
regs->gpr[rd] = 0xffff;
break;
case OP_LHZU:
regs->gpr[rd] = 0xffff;
regs->gpr[ra] += (s16)d;
break;
default:
return 0;
}
return 1;
}
static int is_in_pci_mem_space(phys_addr_t addr)
{
struct pci_controller *hose;
struct resource *res;
int i;
list_for_each_entry(hose, &hose_list, list_node) {
if (!(hose->indirect_type & PPC_INDIRECT_TYPE_EXT_REG))
continue;
for (i = 0; i < 3; i++) {
res = &hose->mem_resources[i];
if ((res->flags & IORESOURCE_MEM) &&
addr >= res->start && addr <= res->end)
printk(KERN_EMERG "%s ==> returning from checking addresses\n", __FUNCTION__);
return 1;
}
}
printk(KERN_EMERG "%s ==> returning without checking addresses\n", __FUNCTION__);
return 1;
}
int fsl_pci_mcheck_exception(struct pt_regs *regs)
{
u32 inst;
int ret;
phys_addr_t addr = 0;
/* Let KVM/QEMU deal with the exception */
if (regs->msr & MSR_GS)
return 0;
#ifdef CONFIG_PHYS_64BIT
addr = mfspr(SPRN_MCARU);
addr <<= 32;
#endif
addr += mfspr(SPRN_MCSRR0);
printk(KERN_EMERG "%s-> SPRN_MCAR: 0x%x\n", __FUNCTION__, addr);
printk(KERN_EMERG "%s-> SPRN_MCSRR0: 0x%x\n", __FUNCTION__, mfspr(SPRN_MCSRR0));
printk(KERN_EMERG "%s-> SPRN_MCSRR1: 0x%x\n", __FUNCTION__, mfspr(SPRN_MCSRR1));
printk(KERN_EMERG "%s-> current->pid: 0x%x\n", __FUNCTION__, current->pid);
#ifdef CONFIG_E500
if (mfspr(SPRN_EPCR) & SPRN_EPCR_ICM)
addr = PFN_PHYS(vmalloc_to_pfn((void *)mfspr(SPRN_DEAR)));
printk(KERN_EMERG "%s-> SPRN_DEAR: 0x%x\n", __FUNCTION__, addr);
#endif
printk(KERN_EMERG "%s-> before get_user: 0x%x, 0x%x\n", __FUNCTION__, regs->nip, inst);
if (is_in_pci_mem_space(addr)) {
if (user_mode(regs)) {
pagefault_disable();
/* I am using __get_user_inatomic to get the instruction from the user
space as any other get_user versions were resulting in -EFAULT as they can
sleep and this needs to be called from user context and we are in interrupt
context.
*/
ret = __get_user_inatomic(inst, &regs->nip);
pagefault_enable();
} else {
ret = probe_kernel_address(regs->nip, inst);
}
printk(KERN_EMERG "%s-> after get_user: 0x%x, 0x%x, 0x%d\n", __FUNCTION__, regs->nip, inst, ret);
if (mcheck_handle_load(regs, inst)) {
regs->nip += 4;
printk(KERN_EMERG "%s-> after mcheck_handle load: 0x%x, 0x%x\n", __FUNCTION__, regs->nip, inst);
return 1;
}
}
return 0;
}
#endif
Here's the code I added to fix the kernel panic. Looks like regs->gpr[0] is destination address of the LD instruction and incrementing the instruction pointer took care of the return from the interrupt context cleanly. I still have the issue of verifying that this interrupt originated due to access of PCI device address. Right now I have commented out the PCI address range check and without this check I am able to access any address without crashing the system which is even worse.
Yes. Even a null pointer access doesn't crash the system anymore. I tried it with devmem2 and accessed a nullpointer and the call goes through the interrupt and returns safely after dumping the logs from the interrupt handler.
regs->gpr[0] = 0xffffffff;
regs->nip += 4;
return 1;
if (mcheck_handle_load(regs, inst)) {

Why pam_loginuid module fails on writing to /proc/self/loginuid with -EPERM?

I found that application using pam library to authenticate fails on error:
Error writing /proc/self/loginuid: Operation not permitted
By strace i found that fail is on write to the /proc/self/loginuid file.
Further inspection and adding some debug code to kernel (code below):
static ssize_t proc_loginuid_write(struct file * file, const char __user * buf,
size_t count, loff_t *ppos)
{
struct inode * inode = file_inode(file);
uid_t loginuid;
kuid_t kloginuid;
int rv;
printk(KERN_DEBUG "proc_loginuid_write\n");
printk(KERN_DEBUG "a+++ %s\n", current->comm);
printk(KERN_DEBUG "b+++ %s\n", pid_task(proc_pid(inode), PIDTYPE_PID)->comm);
printk(KERN_DEBUG "+++2++ pid = %d\n", current->pid);
printk(KERN_DEBUG "+++3++ pid = %d\n", pid_task(proc_pid(inode), PIDTYPE_PID)->pid);
rcu_read_lock();
if (current != pid_task(proc_pid(inode), PIDTYPE_PID)) {
rcu_read_unlock();
printk(KERN_ERR "proc_loginuid_write failed by permission!\n");
return -EPERM;
}
rcu_read_unlock();
if (*ppos != 0) {
/* No partial writes. */
return -EINVAL;
}
rv = kstrtou32_from_user(buf, count, 10, &loginuid);
if (rv < 0)
return rv;
/* is userspace tring to explicitly UNSET the loginuid? */
if (loginuid == AUDIT_UID_UNSET) {
kloginuid = INVALID_UID;
} else {
kloginuid = make_kuid(file->f_cred->user_ns, loginuid);
if (!uid_valid(kloginuid))
return -EINVAL;
}
rv = audit_set_loginuid(kloginuid);
if (rv < 0)
return rv;
return count;
}
showed in dmesg that:
[ 30.672242] proc_loginuid_write
[ 30.672249] a+++ testapp
[ 30.672251] b+++ testapp
[ 30.672254] +++2++ pid = 2920
[ 30.672257] +++3++ pid = 2451
[ 30.672259] proc_loginuid_write failed by permission!
Name testapp is intentionally changed name. So it looks like the file /proc/self/loginuid is file created by parent, and it is read by child thread.
I tested same code on kernel 3.14 and 4.9 and on 3.14 kernel it works and on kernel 4.9 it doesn't works. Why?
I found the solution for the problem.
Old kernel 3.14 has turned off option CONFIG_AUDITSYSCALL in config. So on there was no file /proc/self/loginuid and pam module simply don't cares when there is no such file.
On newer kernel 4.9 option is automatically selected by CONFIG_AUDIT=y.
So simplest solution is to turn off CONFIG_AUDIT option, but why in process of kernel evolution CONFIG_AUDITSYSCALL became a non controllable option is matter for other question.
Thanks!

Difference between mutual exclusion and blocked-IO in kernel programming?

I am unable to understand the difference between the follwing two codes. Can any body explain the difference between the following codes & also explain the differnece between semaphore and mutex with example....
Mutual exclusion:
DEFINE_SEMAPHORE(mysem);
static ssize_t dev_read(struct file *file,char *buf, size_t lbuf, loff_t *ppos)
{
int maxbytes, bytes_to_do, nbytes;
maxbytes = SIZE - *ppos;
if(maxbytes < lbuf) bytes_to_do = maxbytes;
else bytes_to_do = lbuf;
if(bytes_to_do == 0){
printk("reached end of device\n");
return -ENOSPC;
}
if(down_interruptible(&mysem))
return -ERESTARTSYS;
nbytes = bytes_to_do - copy_to_user(buf,dev_buf+*ppos,bytes_to_do);
up(&mysem);
*ppos += nbytes;
return nbytes;
}
static ssize_t dev_write(struct file *file,const char *buf, size_t lbuf,
loff_t *ppos)
{
int maxbytes, bytes_to_do, nbytes;
maxbytes = SIZE - *ppos;
if(maxbytes < lbuf) bytes_to_do = maxbytes;
else bytes_to_do = lbuf;
if(bytes_to_do == 0){
printk("reached end of device\n");
return -ENOSPC;
}
if(down_interruptible(&mysem))
return -ERESTARTSYS;
nbytes = bytes_to_do - copy_from_user(dev_buf+*ppos,buf,bytes_to_do);
ssleep(10);
up(&mysem);
*ppos += nbytes;
return nbytes;
}
Blocked IO
init_MUTEX_LOCKED(&mysem);
static ssize_t dev_read(struct file *file,char *buf, size_t lbuf, loff_t *ppos)
{
int maxbytes, bytes_to_do, nbytes;
maxbytes = SIZE - *ppos;
if(maxbytes < lbuf) bytes_to_do = maxbytes;
else bytes_to_do = lbuf;
if(bytes_to_do == 0){
printk("reached end of device\n");
return -ENOSPC;
}
if(down_interruptible(&mysem))
return -ERESTARTSYS;
nbytes = bytes_to_do - copy_to_user(buf,dev_buf+*ppos,bytes_to_do);
*ppos += nbytes;
return nbytes;
}
static ssize_t dev_write(struct file *file,const char *buf, size_t lbuf,
loff_t *ppos)
{
int maxbytes, bytes_to_do, nbytes;
maxbytes = SIZE - *ppos;
if(maxbytes < lbuf) bytes_to_do = maxbytes;
else bytes_to_do = lbuf;
if(bytes_to_do == 0){
printk("reached end of device\n");
return -ENOSPC;
}
nbytes = bytes_to_do - copy_from_user(dev_buf+*ppos,buf,bytes_to_do);
ssleep(10);
up(&mysem);
*ppos += nbytes;
return nbytes;
}
Mutex is nothing but a binary semaphore. It means that mutex can have only two states : locked and unlocked. But semaphore can have more than two counts. So number of processes which can acquire the semaphore lock is equal to the count with which semaphore is initialized.
In your example, in first code snippet, whether it is read or write, whichever is acquiring the lock is itself releasing the lock also after it completes its respective read or write. Both can not work simultaneously due to mutex.
While in second code snippet, the code exhibits blocking I/O concept which is designed to solve a problem explained in a book Linux Device Drivers(LDD) : "what to do when there's no data yet to read, but we're not at end-of-file. The default answer is go to sleep waiting for data". As you can see in the code, lock is declared as Mutex and that also in locked state. So, if any read comes when there is no data, it can not acquire a lock as mutex is already in locked state, so it will go to sleep (In short read is blocked). Whenever any write come, it first writes to device and then it releases the mutex. So, now blocked read can acquire that lock and can complete its read process. Here also, both can not work simultaneously, but lock acquiring and releasing mechanism is synchronized in such a manner that read can not progress until write does not write anything to device.

Resources