Kernel Crashes due to OOM error (USB_SUBMIT_URB) - linux-kernel

Scenario :
I am calling usb_submit_urb in ioctl call to send audio packets from the application.
Code implementation is as follows :
retval = copy_from_user(&pkt_1722, pkt_1722_userspace,
sizeof(struct ifr_data_struct_1722));
if(retval) {
printk("copy_from_user error: pkt_1722\n");
retval = -EINVAL;
}
usb_init_urb(bky_urb_alloc[bky_urb_context.bky_i]);
usb_fill_bulk_urb(bky_urb_alloc[bky_urb_context.bky_i],
dev->udev,
priv->avb_a_out,
(void*) dma_buf[bky_urb_context.bky_i],
112,
bky_write_bulk_callback,
&bky_urb_context);
retval = usb_submit_urb(bky_urb_alloc[bky_urb_context.bky_i],
GFP_ATOMIC);
if (retval) {
printk(KERN_INFO "%s - failed submitting write urb, error %d",
__FUNCTION__, retval);
goto error;
}
I am maintaining an array of urbs so that I can reuse them after their completion handlers get called. The allocation of urb and dma_buf takes place once in probe.
Problem :
I am able to stream 1722 packets for few hours and after that kernel crashes and I can only see the black screen with the call traces and says OOM error (Out of memory). The PID that caused the error is some other kernel process running in the background which tries to allocate pages
but it fails and shows OOM and kernel crashes.
May be this problem is due to the external fragmentation that takes place over the period of time.
Any inputs will be of great help.

1) Are the USB urb&s being consumed by something? Whoever has the pointer to the block is responsible for passing it on or freeing the buffer.
2) Have you set the vm.min_free_kbytes in /etc/sysctl.conf to at least 1% of system memory?
3) While the system runs, capture /proc/slabinfo in a shell loop and see if there is a leak somewhere.

Related

What causes WriteFile to return error 38 (ERROR_HANDLE_EOF)?

What would cause WriteFile to return error 38 (ERROR_HANDLE_EOF, Reached the end of the file)? The "file" in this case is a mailslot. The way my program works is I have a process (running as a Windows service) that creates multiple child processes. Each child opens a mailslot of the same name in order to send status information back to its parent. In my small scale testing this works fine, but I am seeing cases where when I have several processes
running (like 16) I am getting this error. The code below shows how I am opening and writing to the mailslot in the child process.
Is it perhaps because the parent is not reading the mailslot fast enough? Is there a way to increase capacity of a mailslot so that end of file never gets reached? I really don't see how a mailslot can get full anyway, as long
as there is disk space (which there is plenty of).
char gLocalSlotName[256]="\\\\.\\mailslot\\TMAgentSlot-ComputerName";
gAgentSlot = CreateFile(gLocalSlotName, GENERIC_WRITE, FILE_SHARE_READ, (LPSECURITY_ATTRIBUTES) NULL,
OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, (HANDLE) NULL);
fResult = WriteFile(gAgentSlot, (char *)&ProcStat, sizeof(PROCSTAT), &cbWritten, (LPOVERLAPPED) NULL);
if (!fResult) {
derr = GetLastError();
printf("WriteFile error=%d", derr);
}
WriteFile is thin shell over NtWriteFile. if NtWriteFile return error NTSTATUS - it will be converted to its equivalent win32 error code (via RtlNtStatusToDosError) and WriteFile return false. win32 error code you can got via GetLastError(). however original NTSTATUS you can got via RtlGetLastNtStatus() exported by ntdll.dll api. problem with win32 errors codes - some time several different NTSTATUS values converted to the same win32 error.
in case ERROR_HANDLE_EOF - 2 different NTSTATUS converted to it:
STATUS_END_OF_FILE and STATUS_FILE_FORCED_CLOSED. the STATUS_END_OF_FILE never (look like) returned by msfs.sys (driver which handle mailslots). from another side - STATUS_FILE_FORCED_CLOSED (The specified file has been closed by another process.) can be returned when you write data to mailslot (by msfs.MsCommonWrite) if server end of the mailslot (end which you create via CreateMailslot call) already closed.
formally when last server handle was closed - all connecting clients marked as in closing state (inside MsFsdCleanup) and then if you call WriteFile for such client - the STATUS_FILE_FORCED_CLOSED is returned.
so -
What causes WriteFile to return error 38 (ERROR_HANDLE_EOF)?
the server process by some reason close self mailslot handle. you need search in this direction - when and why you close mailsot handle in parent process

linux device driver stuck in spin lock due to ongoing read on first access

I am working on a Linux driver for usb device which fortunately is identical to that in the usb_skeleton example driver which is part of the standard kernel source.
With the 4.4 kernel, it was a breeze, I simply changed the VID and PID and a few strings and the driver compiled and worked perfectly both on x64 and ARM kernels.
But it turns out I have to make this work with a 3.2 kernel. I have no choice in this. I made the same modifications to the skeleton driver in the 3.2 source. Again, I did not have to change actual code, just the VID, PID and some strings. Although it compiles and loads fine (and shows up in /dev), it permanently hangs in the first attempt to do a read from /dev/myusbdev0.
The following code is from the read function, which is supposed to read from the bulk endpoint. When I attempt to read the device, I see the first message that it is going to block due to ongoing io. Then nothing. The user program trying to read this is hung, and cannot be killed with kill -9. The linux machine cannot even reboot - I have to power cycle. There are no error messages, exceptions or anything like that. It seems fairly certain it is hanging in the part that is commented 'IO May Take Forever'.
My question is: why would there be ongoing IO when no program has done any IO with the driver yet? Can I fix this in driver code, or does the user program have to do something before it can start reading from /dev/myusbdev0 ?
In this case the target machine an embedded ARM device similar to a Beaglebone Black. Incidently, the 4.4 kernel version of this driver works perfectly with on the Beaglebone with the same user-mode test program.
/* if IO is under way, we must not touch things */
retry:
spin_lock_irq(&dev->err_lock);
ongoing_io = dev->ongoing_read;
spin_unlock_irq(&dev->err_lock);
if (ongoing_io) {
dev_info(&interface->dev,
"USB PureView Pulser Receiver device blocking due to ongoing io -%d",
interface->minor);
/* nonblocking IO shall not wait */
if (file->f_flags & O_NONBLOCK) {
rv = -EAGAIN;
goto exit;
}
/*
* IO may take forever
* hence wait in an interruptible state
*/
rv = wait_for_completion_interruptible(&dev->bulk_in_completion);
dev_info(&interface->dev,
"USB PureView Pulser Receiver device completion wait done io -%d",
interface->minor);
if (rv < 0)
goto exit;
/*
* by waiting we also semiprocessed the urb
* we must finish now
*/
dev->bulk_in_copied = 0;
dev->processed_urb = 1;
}
Writing this up as an answer since there was no response to my comments. Kernel commit c79041a4[1], which was added in 3.10, fixes "blocked forever in skel_read". Looking at the code above, I see that the first message can trigger without the second being shown if the device file has the O_NONBLOCK flag set. As described in the commit message, if the completion occurs between read() calls the next read() call will end up at the uninterruptible wait, waiting for a completion which has already occurred.
[1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=c79041a4
Obviously I am not sure that this is what you are seeing, but I think there is a good chance. If that is correct then you can apply the change (manually) to your driver and that should fix the problem.

Linux Multiprocessings in kernel space

I do multiprocessing in user space by creating multiple threads and pinning them to specific cpu cores. They process data in a specific place in memory, which is never swapped and it is constantly filled with new data by a streaming device driver. I want to get the best in term of speed so I thought to move everything in kernel space to get rid of all the memory pointer conversion and the kernel->user communications and vice versa. The process will be started with an ioctl to the driver. Then multiple threads will be spawned and pinned to specific cores. The process is terminated with another ioctl by the user.
So far I gathered info about kthreads, which is not so well documented, and I wrote this in the ioctl function:
thread_ch1 = kthread_create(&Channel1_thread,(void *)buffer,"Channel1_thread");
thread_ch2 = kthread_create(&Channel2_thread,(void *)buffer,"Channel2_thread");
thread_ch3 = kthread_create(&Channel3_thread,(void *)buffer,"Channel3_thread");
thread_ch4 = kthread_create(&Channel4_thread,(void *)buffer,"Channel4_thread");
printk(KERN_WARNING "trd1: %p, trd2: %p, trd3: %p, trd4: %p\n",thread_ch1,thread_ch2,thread_ch3,thread_ch4);
kthread_bind(thread_ch1,0);
kthread_bind(thread_ch2,1);
kthread_bind(thread_ch3,2);
kthread_bind(thread_ch4,3);
wake_up_process(thread_ch1);
wake_up_process(thread_ch2);
wake_up_process(thread_ch3);
wake_up_process(thread_ch4);
and the ioctl returns.
Each Channeln_thread is just a for loop:
int Channel1_thread(void *buffer)
{
uint64_t i;
for(i=0;i<10000000000;i++);
return 0;
}
The threads seem to never be executed and thread_ch1-4 have non NULL pointers. If I add a small delay before ioctl returns and I can see the threads running.
Can someone shine some light?
Thanks

Netlink: sending from kernel to user - EAGAIN and ENOBUFS

I'm having a lot of trouble sending netlink messages from kernel module to userspace-daemon. They randomly fail. On the kernel side, the genlmsg_unicast fails with EAGAIN while on the user-side, nl_recvmsgs_default (function from libnl) fails with NLE_NOMEM which is caused by recvmsg syscall failing with ENOBUFS.
Netlink messages are small, maximum payload size is ~300B.
Here is the code for sending message from kernel:
int send_to_daemon(void* msg, int len, int command, int seq, u32 pid) {
struct sk_buff* skb;
void* msg_head;
int res, payload;
payload = GENL_HDRLEN+nla_total_size(len)+36;
skb = genlmsg_new(payload, GFP_KERNEL);
msg_head = genlmsg_put(skb, pid, seq, &psvfs_gnl_family, 0, command);
nla_put(skb, PSVFS_A_MSG, len, msg);
genlmsg_end(skb, msg_head);
genlmsg_unicast(&init_net, skb, pid);
return 0;
}
I absolutely have no idea why this is happening and my project just won't work because of that! I really hope someone could help me with that.
I wonder if you are running on a 64bits machine. If it is the case, I suspect that the use of an int as the type of payload can be the root of some issues as genlmsg_new() expects a size_t which is 64bits on x86_64.
Secondly, I don't think you need to add GENL_HDRLEN to payload as this is taken care of by genlmsg_new() (by using genlmsg_total_size(), which returns genlmsg_msg_size() which finally does the addition). Why this + 36 by the way? Does not look very portable nor explicit on what it is there for.
Hard to tell more without having a look at the rest of the code.
I was having a similar problem receiving ENOBUFS via recvmsg from a netlink socket. I found that my problem was the kernel socket buffer filling before userspace could drain it.
From the netlink(7) man page:
However, reliable transmissions from kernel to user are impossible in
any case. The kernel can't send a netlink message if the socket
buffer is full: the message will be dropped and the kernel and the
user-space process will no longer have the same view of kernel state.
It is up to the application to detect when this happens (via the
ENOBUFS error returned by recvmsg(2)) and resynchronize.
I addressed this problem by increasing the size of the socket receive buffer (setsockopt(fd, SOL_SOCKET, SO_RCVBUF, ...)
, or nl_socket_set_buffer_size() if you are using libnl).

Win32 API: ReadFile not timing out

I'm writing some code to interface with a piece of hardware. The hardware connects to the PC via a USB with a USB-to-Serial converter inside the device (it shows up as a COM port device in Windows).
I'm having issues with the Win32 API ReadFile system call. I can't seem to get it to work as advertised. I've setup the COMMTIMEOUTS structure as so:
COMMTIMEOUTS ct;
ct.ReadIntervalTimeout = MAXDWORD;
ct.ReadTotalTimeoutconstant = 0;
ct.ReadTotalTimeoutMultiplier = 0;
ct.WriteTotalTimeoutConstant = 0;
ct.WriteTotalTimeoutMultiplier = 0;
if(SetCommTimeouts(device_id_, &ct) == 0)
{
return ERROR; // this is never hit.
}
Which according to the Win32 API documentation, says:
ReadIntervalTimeout
The maximum time
allowed to elapse between the arrival
of two bytes on the communications
line, in milliseconds. During a
ReadFile operation, the time period
begins when the first byte is
received. If the interval between the
arrival of any two bytes exceeds this
amount, the ReadFile operation is
completed and any buffered data is
returned. A value of zero indicates
that interval time-outs are not used.
A value of MAXDWORD, combined with
zero values for both the
ReadTotalTimeoutConstant and
ReadTotalTimeoutMultiplier members,
specifies that the read operation is
to return immediately with the bytes
that have already been received, even
if no bytes have been received.
The command I'm sending is supposed to return a single byte integer. Most of the time, the command is received by the device and it returns the appropriate value. Sometimes, however, it doesn't seem to return a value and ReadFile() blocks until more bytes are recieved (eg. by pressing buttons on the device). Once a button is hit, the initial integer response I was expecting is received along with the button press code. While this isn't the behavior I'm expecting from the device itself, I'm more concerned with ReadFile() blocking when it shouldn't be, according to the MSDN documentation. Is there a remedy for ReadFile() blocking here?
D'oh! Turns out ReadFile blocking was just a symptom, not the problem. The hardware device in question only has a 4MHz processor in it. Splitting up the 3 character command written to the device and sending them individually with a 1ms pause between characters fixes the issue.

Resources