kernel paging request fails in try_module_get() - linux-kernel

The following code fails in one of my linux-kernel modules
printk("This module: %p\n",THIS_MODULE);
DEBUG_USE_COUNT(p);
printk("This module refcount: %d\n", module_refcount(THIS_MODULE));
DEBUG_USE_COUNT(p);
if (!try_module_get(THIS_MODULE)) {
printk_stderr("can't get module\n");
return -EFAULT;
}
The code itself works in usual environment, but when I try to execute it in the function called from within another module, it fails with paging error. (another module is passed the pointer to the function in question during initialization)
Any ideas why module can't increment its reference count being called from another module?
Are there any special limitations that apply to try_get_module call?
[ 7888.065029] BUG: unable to handle kernel paging request at fa69206
8
[ 7888.067470] IP: [<f926a2b6>] _ZL18open_station_sharePKcP23__camac_
kernel_open_argP4file+0x84/0x8ec [camac_k0607_lsi6] //function in question, calling try_module_get()
[ 7888.069014] Call Trace:
[ 7888.069014] [<c10ac2b7>] ? __kmalloc+0x104/0x110
[ 7888.069014] [<c12518f5>] ? printk+0xe/0x11
[ 7888.069014] [<f90fae79>] ? T.633+0x46/0x4b [camac_mx]
[ 7888.069014] [<f90fb07e>] ? camac_mx_ioctl+0x200/0x228 [camac_mx] //function of another module that calls the one in question
[ 7888.069014] [<c10ba415>] ? vfs_ioctl+0x58/0x72
[ 7888.069014] [<c10ba966>] ? do_vfs_ioctl+0x492/0x4d6
[ 7888.069014] [<c109007b>] ? shmem_parse_options+0x167/0x281
[ 7888.069014] [<c10ae69e>] ? fd_install+0x1b/0x38
[ 7888.069014] [<c10ae88b>] ? do_sys_open+0xc8/0xdd
[ 7888.069014] [<c10ba9ee>] ? sys_ioctl+0x44/0x64
[ 7888.069014] [<c100305b>] ? sysenter_do_call+0x12/0x28
I would also be great if someone explains the garbage on the top of the printed stack. There should not be any functions in cross-module call. Top three functions on the stack are meaningless for me.

THIS_MODULE may evaluate to NULL if the particular source file is compiled into the kernel rather than as part of a module, and module_refcount does not like getting NULL.
Also, the use of C++ in kernel modules is not recommended for it may interfere with about everything (think of exceptions and all that).

Related

raw_spin_lock(): unexpected null pointer exception

So I adapted some of the /kernel/sched/rt.c code to write my own simple CPU scheduler, and I'm getting a null pointer dereference exception when I try to acquire a lock. This is despite me printk()'ing all of the relevant pointers, and seeing that they're not NULL.
//Snippet from my adaptation of update_curr_rt()
//wrr_rq is a struct wrr_rq*
printk("Before loop, wrr_rq pointer is %p\n",wrr_rq);
printk("Before loop, &wrr_rq->wrr_runtime_lock is %p\n",&wrr_rq->wrr_runtime_lock);
for_each_sched_wrr_entity(wrr_se) {
printk("1\n");
wrr_rq = wrr_rq_of_se(wrr_se);
printk("2\n");
raw_spin_lock(&wrr_rq->wrr_runtime_lock);
printk("3\n");
[ 263.595176] Before loop, wrr_rq is 00000000aebb4d6d
[ 263.596283] Before loop, &wrr_rq->wrr_runtime_lock is 0000000015dee87f
[ 263.597764] 1
[ 263.598141] wrr_rq_of_se: called
[ 263.598888] 2
[ 263.599268] BUG: kernel NULL pointer dereference, address: 0000000000000068
[ 263.600836] #PF: supervisor write access in kernel mode
[ 263.602027] #PF: error_code(0x0002) - not-present page
...
[ 263.656134] RIP: 0010:_raw_spin_lock+0x7/0x20
I've printed all the relevant pointers and seen they're not NULL (and have values quite a bit above 0), but I still get this exception. I tried using the elixir browser to see what is happening with the raw_spin_lock() macro, and it doesn't seem like anything crazy is happening...
In addition, the runqueue lock is already held when this code is called (the runqueue lock is acquired by task_sched_runtime()).
Any thoughts appreciated.
Thanks.
Credit to #0andriy: It turns out that kernel NULL pointers when printed by %p get hashed to some other unique value that may not be NULL, and so when I printed things with %px I saw they were in fact NULL.

Unable to Call Function in Go debugger

I am following the "Little Go Book" by Karl Seguin, in order to learn Go.
My working environment is Visual Studio Code.
Upon debugging, when I try to call a function from the debug console, i get the following error:
"function calls not allowed without using 'call'", if I try using "call fib(10)", i get "Unable to eval expression: "1:6: expected 'EOF', found fib".
This is the function I am trying to evaluate:
//Fibonnaci
func fib(n int) int64 {
if n == 0 {
return 0
} else if n == 1 {
return 1
} else {
return fib(n-1) + fib(n-2)
}
}
If i try to call the function from the code itself ( from the main() for instance, it works perfectly).
However, if I set a breakpoint and try to call the same function from the debugger console, I get the below error:
Eval error: function calls not allowed without using 'call'
call fib(10)
Unable to eval expression: "1:6: expected 'EOF', found fib"
Failed to eval expression: {
"Expr": "call fib(10)",
"Scope": {
"goroutineID": 1,
"frame": 0
},
"Cfg": {
"followPointers": true,
"maxVariableRecurse": 1,
"maxStringLen": 64,
"maxArrayValues": 64,
"maxStructFields": -1
}
}
Looks like "Function calls via delve 'call' are not supported" yet github issue in microsoft/vscode-go repo :(
The issue vscode-go issue 100 "debug: support function calls via delve 'call'" just got closed with PR 101 and commit 5a7752c / CL 249377
Delve supports function calls. Even though it is still experimental and can be applied only to a limited set of functions, this is a useful feature, many vscode-go users long for.
Unlike other javascript/typescript debuggers, delve treats function calls specially and requires different call paths than usual expression evaluation.
That is because Go is a compiled, runtime-managed GC language, calling a function safely from debugger is complex.
DAP and VS Code UI does not distinguish function calls and other expression evaluation either, so we have to implement this in the same evaluateRequest context.
We use a heuristic to guess which route (call or expression evaluation) we need to take based on evaluateRequest's request.
This is part of the 0.17.0 milestone, yet to be released, and available for now in the nightly build.

USB crash results in "nobody cared" warning

I am working on a USB crash related issue on my board which has a USB 2.0 based HCI . The issue is something like below:
1.691533] irq 36: nobody cared (try booting with the "irqpoll" option)
[ 1.698242] CPU: 0 PID: 87 Comm: kworker/0:1 Not tainted 4.9.88 #24
[ 1.704509] Hardware name: Freescale i.MX8QXP MEK (DT)
[ 1.709659] Workqueue: pm pm_runtime_work
[ 1.713675] Call trace:
[ 1.716123] [<ffff0000080897d0>] dump_backtrace+0x0/0x1b0
[ 1.721523] [<ffff000008089994>] show_stack+0x14/0x20
[ 1.726582] [<ffff0000083daff0>] dump_stack+0x94/0xb4
[ 1.731638] [<ffff00000810f064>] __report_bad_irq+0x34/0xf0
[ 1.737212] [<ffff00000810f4ec>] note_interrupt+0x2e4/0x330
[ 1.742790] [<ffff00000810c594>] handle_irq_event_percpu+0x44/0x58
[ 1.748974] [<ffff00000810c5f0>] handle_irq_event+0x48/0x78
[ 1.754553] [<ffff0000081100a8>] handle_fasteoi_irq+0xc0/0x1b0
[ 1.760390] [<ffff00000810b584>] generic_handle_irq+0x24/0x38
[ 1.766141] [<ffff00000810bbe4>] __handle_domain_irq+0x5c/0xb8
[ 1.771979] [<ffff000008081798>] gic_handle_irq+0x70/0x15c
1.807416] 7a40: 00000000000002ba ffff80002645bf00 00000000fa83b2da 0000000001fe116e
[ 1.815252] 7a60: ffff000088bf7c47 ffffffffffffffff 00000000000003f8 ffff0000085c47b8
[ 1.823088] 7a80: 0000000000000010 ffff800026484600 0000000000000001 ffff8000266e9718
[ 1.830925] 7aa0: ffff00000b8b0008 ffff800026784280 ffff00000b8b000c ffff00000b8d8018
[ 1.838760] 7ac0: 0000000000000001 ffff000008b76000 0000000000000000 ffff800026497b20
[ 1.846596] 7ae0: ffff00000810bd24 ffff800026497b20 ffff000008851d18 0000000000000145
[ 1.854433] 7b00: ffff000008b8d6c0 ffff0000081102d8 ffffffffffffffff ffff00000810dda8
[ 1.862268] [<ffff000008082eec>] el1_irq+0xac/0x120
[ 1.867155] [<ffff000008851d18>] _raw_spin_unlock_irqrestore+0x18/0x48
[ 1.873684] [<ffff00000810bd24>] __irq_put_desc_unlock+0x1c/0x48
[ 1.879695] [<ffff00000810de10>] enable_irq+0x48/0x70
[ 1.884756] [<ffff0000085ba8f8>] cdns3_enter_suspend+0x1f0/0x440
[ 1.890764] [<ffff0000085baca0>] cdns3_runtime_suspend+0x48/0x88
[ 1.896776] [<ffff0000084cf398>] pm_generic_runtime_suspend+0x28/0x40
[ 1.903223] [<ffff0000084dc3e8>] genpd_runtime_suspend+0x88/0x1d8
[ 1.909320] [<ffff0000084d0e08>] __rpm_callback+0x70/0x98
[ 1.914724] [<ffff0000084d0e50>] rpm_callback+0x20/0x88
[ 1.919954] [<ffff0000084d1b2c>] rpm_suspend+0xf4/0x4c8
[ 1.925184] [<ffff0000084d20fc>] rpm_idle+0x124/0x168
[ 1.930240] [<ffff0000084d26c0>] pm_runtime_work+0xa0/0xb8
[ 1.935732] [<ffff0000080dc1dc>] process_one_work+0x1dc/0x380
[ 1.941481] [<ffff0000080dc3c8>] worker_thread+0x48/0x4d0
[ 1.946885] [<ffff0000080e2408>] kthread+0xf8/0x100
[ 1.957080] handlers:
[ 1.959350] [<ffff0000085ba668>] cdns3_irq
[ 1.963449] Disabling IRQ #36
After making a small study on this kind of crash occurrence, I came to know that the kernel is trying to disable the IRQ line since it is not handled nearly for 100000 times.
I have a linux bsp code where so many unwanted components are stuffed and this code does not show any kind of above crash. Once I removed all the unwanted components , this crash started showing up during boot up. And the tricky part is, the crash does not happen all the time. The irq handler returns IRG_HANDLED ,like 7 times out of 10 tries. And I can say 7 times no crash is happening.
I added a print in the irq_handler and this case may be due to print delay, crash was not occuring even after 15 attempts of booting.
Anyone having any idea what is actually happening?
First - AFAIK it is not crash, but just warning. Second - you have already answered your question :) This is the situation when none of the registered irq handlers cared of this interrupt. As you said when it returns IRQ_HANDLED (7 out of 10) the message is not appear. It means that the interrupt handler 3 times out of 10 returns something else that IRQ_HANLDED or IRQ_WAKE_THREAD. Just check the situation when it occurs in sources - when it returns anything else than IRQ_HANDLED or IRQ_WAKE_THREAD.

blocked for more than 120 seconds

I am trying to write a block device driver that reads/writes blocks off of/to a network socket. At some point the when reading multiple blocks the application that uses this driver seems to hang (but would still accept input even though it does nothing with it) and the system in general seems responsive. dmesg shows the following message. And overall I can not use the driver for anything even if I started any other application that uses it.
I am using linux kernel v3.9.
Anyone can help fix this?
[ 489.779458] INFO: task xxd:2939 blocked for more than 120 seconds.
[ 489.779466] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 489.779469] xxd D 0000000000000000 0 2939 2237 0x00000006
[ 489.779475] ffff8801912a9998 0000000000000046 02fc000000000008 ffff8801bfff7000
[ 489.779479] ffff8801b2ef45f0 ffff8801912a9fd8 ffff8801912a9fd8 ffff8801912a9fd8
[ 489.779482] ffff8801b61e9750 ffff8801b2ef45f0 ffff8801912a9998 ffff8801b8e34af8
[ 489.779485] Call Trace:
[ 489.779497] [<ffffffff81131ad0>] ? __lock_page+0x70/0x70
[ 489.779505] [<ffffffff816e86a9>] schedule+0x29/0x70
[ 489.779510] [<ffffffff816e877f>] io_schedule+0x8f/0xd0
[ 489.779514] [<ffffffff81131ade>] sleep_on_page+0xe/0x20
[ 489.779518] [<ffffffff816e654a>] __wait_on_bit_lock+0x5a/0xc0
[ 489.779522] [<ffffffff811348aa>] ? find_get_pages+0xca/0x150
[ 489.779526] [<ffffffff81131ac7>] __lock_page+0x67/0x70
[ 489.779531] [<ffffffff8107fa50>] ? autoremove_wake_function+0x40/0x40
[ 489.779536] [<ffffffff81140bd2>] truncate_inode_pages_range+0x4b2/0x4c0
[ 489.779540] [<ffffffff81140c65>] truncate_inode_pages+0x15/0x20
[ 489.779545] [<ffffffff811d331c>] kill_bdev+0x2c/0x40
[ 489.779548] [<ffffffff811d3931>] __blkdev_put+0x71/0x1c0
[ 489.779552] [<ffffffff811aeb48>] ? __d_free+0x48/0x70
[ 489.779556] [<ffffffff811d3adb>] blkdev_put+0x5b/0x160
[ 489.779559] [<ffffffff811d3c05>] blkdev_close+0x25/0x30
[ 489.779564] [<ffffffff8119b16a>] __fput+0xba/0x240
[ 489.779568] [<ffffffff8119b2fe>] ____fput+0xe/0x10
[ 489.779572] [<ffffffff8107ba18>] task_work_run+0xc8/0xf0
[ 489.779577] [<ffffffff8105f797>] do_exit+0x2c7/0xa70
[ 489.779581] [<ffffffff8106f32e>] ? send_sig_info+0x1e/0x20
[ 489.779585] [<ffffffff8106f34c>] ? send_sig+0x1c/0x20
[ 489.779588] [<ffffffff8105ffd4>] do_group_exit+0x44/0xa0
[ 489.779592] [<ffffffff8106fe00>] get_signal_to_deliver+0x230/0x600
[ 489.779600] [<ffffffff81014398>] do_signal+0x58/0x8e0
[ 489.779605] [<ffffffff81014ca0>] do_notify_resume+0x80/0xc0
[ 489.779608] [<ffffffff816f241a>] int_signal+0x12/0x17
I had the synchronization done wrong around the socket. This meant some race conditions that left some requests without being served. Those not served requests caused the process to hang.
Adding some mutexes (not semaphores) fixed this.

Calling Win32 functions returning strings with alien in Lua

I'm trying to use alien to call Win32 functions. I tried this code, but it crashes and I don't understand why.
require( "alien" )
local f = alien.Kernel32.ExpandEnvironmentStringsA
f:types( "int", "string", "pointer", "int" )
local buffer = alien.buffer( 512 )
f( "%USERPROFILE%", 0, 512 )
It is a good question as it is, for me, an opportunity to test out Alien...
If you don't mind, I take the opportunity to explain how to use Alien, so people like me (not very used to require) stumbling upon this thread will get started...
You give the link to the LuaForge page, I went there, and saw I needed LuaRock to get it. :-(
I should install the latter someday, but I chose to skip that for now. So I went to the repository and downloaded the alien-0.4.1-1.win32-x86.rock.
I found out it was a plain Zip file, which I could unzip as usual.
After fumbling a bit with require, I ended hacking the paths in the Lua script for a quick test. I should create LUA_PATH and LUA_CPATH in my environment instead, I will do that later.
So I took alien.lua, core.dll and struct.dll from the unzipped folders and put them under a directory named Alien in a common library repository.
And I added the following lines to the start of my script (bad hack warning!):
package.path = 'C:/PrgCmdLine/Tecgraf/lib/?.lua;' .. package.path
package.cpath = 'C:/PrgCmdLine/Tecgraf/lib/?.dll;' .. package.path
require[[Alien/alien]]
Then I tried it with a simple, no-frills function with immediate visual result: MessageBox.
local mb = alien.User32.MessageBoxA
mb:types{ 'long', 'long', 'string', 'string', 'long' }
print(mb(0, "Hello World!", "Cliché", 64))
Yes, I got the message box! But upon clicking OK, I got a crash of Lua, probably like you.
After a quick scan of the Alien docs, I found out the (unnamed) culprit: we need to use the stdcall calling convention for the Windows API:
mb:types{ ret = 'long', abi = 'stdcall', 'long', 'string', 'string', 'long' }
So it was trivial to make your call to work:
local eev = alien.Kernel32.ExpandEnvironmentStringsA
eev:types{ ret = "long", abi = 'stdcall', "string", "pointer", "long" }
local buffer = alien.buffer(512)
eev("%USERPROFILE%", buffer, 512)
print(tostring(buffer))
Note I put the buffer parameter in the eev call...

Resources