rcu_preempt self-detected stall on CPU { 0} - linux-kernel

I am running an application on my intel rangeley board, which is having 3.14.29-rt22 kernel
running on it. Application will run two threads with pri :39 each. for 1 and 2 msec periodically.
Both the threads will be running in continuous while loop, which will be running
only on core 0.
After running sometime, around 10 min. When I press ctrl+c, it is giving logs below.
**INFO: rcu_preempt self-detected stall on CPU { 0} (t=21000 jiffies g=2362 c=2361 q=207)**
**sending NMI to all CPUs:
NMI backtrace for cpu 1**
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.14.29ltsi-rt22-yocto-preempt-rt+ #1
Hardware name: ADI Engineering RCC-VE/RCC-VE, BIOS ADI_RCCVE-01.00.00.04-nodebug 05/06/2015
task: ffff8802761a0000 ti: ffff8802761a8000 task.ti: ffff8802761a8000
RIP: 0010:[<ffffffff8100b451>] [<ffffffff8100b451>] native_read_tsc+0x1/0x20
RSP: 0018:ffff8802761abe28 EFLAGS: 00000003
RAX: 0000000000000000 RBX: ffffffff81e1acc0 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffffff81e1acc0
RBP: ffff8802761abe38 R08: ffff8802761a8000 R09: 0000000000000001
R10: 0000000000000800 R11: 0000000000000000 R12: 000000000000003e
R13: 0000000000014e76 R14: ffff8802761abfd8 R15: ffff88027fc8cf00
FS: 0000000000000000(0000) GS:ffff88027fc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fabcd23f000 CR3: 0000000269589000 CR4: 00000000001007e0
Stack:
ffff8802761abe38 ffffffff8100b4a9 ffff8802761abe60 ffffffff810a6b73
0000000000000001 ffff8802761abfd8 ffffffff81edc030 ffff8802761abec0
ffffffff810b01a5 ffffffffffffff10 ffffffff8103b906 0000000000000000
Call Trace:
[<ffffffff8100b4a9>] ? read_tsc+0x9/0x20
[<ffffffff810a6b73>] ktime_get+0x43/0xc0
[<ffffffff810b01a5>] __tick_nohz_idle_enter+0x25/0x480
[<ffffffff8103b906>] ? native_safe_halt+0x6/0x10
[<ffffffff810b064a>] tick_nohz_idle_enter+0x4a/0x80
[<ffffffff8109a626>] cpu_startup_entry+0x46/0x290
[<ffffffff81031597>] start_secondary+0x1b7/0x210
What can be the reason? Is it because I am continuously using CPU for longtime?
When I print anything from thread on the console, this crash is not happening.

Yes, continuous using of CPU from a high priority thread for a long time (1ms is a large period from the view of scheduler) could be a reason of RCU stall.
From the documentation about RCU stall detector:
The following problems can result in RCU CPU stall
warnings:
... A CPU-bound real-time task in a CONFIG_PREEMPT kernel, which might
happen to preempt a low-priority task in the middle of an RCU
read-side critical section. This is especially damaging if
that low-priority task is not permitted to run on any other CPU,
in which case the next RCU grace period can never complete, which
will eventually cause the system to run out of memory and hang.
... A CPU-bound real-time task in a CONFIG_PREEMPT_RT kernel that
is running at a higher priority than the RCU softirq threads.
This will prevent RCU callbacks from ever being invoked,
and in a CONFIG_PREEMPT_RCU kernel will further prevent
RCU grace periods from ever completing. Either way, the
system will eventually run out of memory and hang.
Performing any system call (like write() to console) from the high priority thread give kernel to perform some work targeted to system's maintainence.
Possibly, sched_yield will helps too.

so I was getting something strikingly similar to this during boot whereby it would hang and pressing any key (even num lock) would unhang it and hang again after a few seconds. Had to do this 5-7 times per boot!
The culprit was, a setting in BIOS , AMD C1E Support was set to Enabled and setting it to Auto or Disabled (tested both) fixed the issue for me! No more stalls/hangs!

Related

What exact order of store installed UWP application launch/activation phase? (HANG_ACTIVATION reports research)

Recently I've faced a weird issue, increased HANG_ACTIVATION reports count when unable to reach Microsoft servers on startup. According to MS dashboard data a lot of our users was unable to launch application on their devices. I've found that issue were caused by failed requests (POST https://licensing.mp.microsoft.com/v7.0/licenses/leases/renew is one of them also azure purchase requests) to Microsoft servers, however, when you start the app from Visual Studio or disable internet connection the app starts just fine, that was definitely an issue related to some MS Store windows services.
After some further research into reports, I assume that the issue occurs when application didn't call onActivate event for a long time. It turns out, some windows service tracks time from application start to the Activation event, then shuts down the app by timeout. Unfortunately, it was hard to find any documentation on this (please point me out if you know where I can find any info on this). I've managed to replicate the same behavior when placing sleep for a ~2min in-between onLaunch and onActivate events. Besides according to network requests diagnostics, there is some try count for MS requests (when tested another app, I've got success run even on failed requests).
I have plenty of initialization code after launch and before onActivated is fired, this initialization process may took up 1 minute for application launch.
That's why I think it's relocation might help overcome the issue but I'm not sure exactly how the activation works under the hood.
Typical dashboard report:
Frame Image Function Offset
0 3D4Medical.comLLC.CompleteAnatomy HANG_ACTIVATION 0x0000000000000000
1 unknown.dll [.ecxr] 0x0000000000000000
2 ntdll.dll NtWaitForMultipleObjects 0x0000000000000000
3 KERNELBASE.dll WaitForMultipleObjectsEx 0x0000000000000000
4 combase.dll MTAThreadWaitForCall 0x0000000000000000
5 combase.dll MTAThreadDispatchCrossApartmentCall 0x0000000000000000
6 combase.dll CSyncClientCall::SendReceive2 0x0000000000000000
7 combase.dll CSyncClientCall::SendReceive 0x0000000000000000
8 combase.dll NdrExtpProxySendReceive 0x0000000000000000
9 rpcrt4.dll NdrClientCall2 0x0000000000000000
10 combase.dll ObjectStublessClient 0x0000000000000000
11 combase.dll ObjectStubless 0x0000000000000000
12 twinapi.appcore.dll Windows::ApplicationModel::Core::CoreApplicationViewAgileContainer::ActivateInternal 0x0000000000000000
13 twinapi.appcore.dll Windows::ApplicationModel::Core::CoreApplicationViewAgileContainer::Activate 0x0000000000000000
14 twinapi.appcore.dll Windows::ApplicationModel::Core::CoreApplication::ActivateForeground 0x0000000000000000
15 twinapi.appcore.dll Windows::ApplicationModel::Core::CoreApplication::ActivateApplication 0x0000000000000000
16 twinapi.appcore.dll Windows::ApplicationModel::Core::ApplicationActivationFactory::Activate 0x0000000000000000
17 rpcrt4.dll Invoke 0x0000000000000000
18 rpcrt4.dll NdrStubCall2 0x0000000000000000
19 combase.dll CStdStubBuffer_Invoke 0x0000000000000000
20 rpcrt4.dll CStdStubBuffer_Invoke 0x0000000000000000
21 combase.dll ObjectMethodExceptionHandlingAction__lambda_ee1df801181086a03fa4f8f75bd5617f_ _ 0x0000000000000000
22 combase.dll DefaultStubInvoke 0x0000000000000000
23 combase.dll ServerCall::ContextInvoke 0x0000000000000000
24 combase.dll AppInvoke 0x0000000000000000
25 combase.dll ComInvokeWithLockAndIPID 0x0000000000000000
26 combase.dll ThreadInvoke 0x0000000000000000
27 rpcrt4.dll DispatchToStubInCNoAvrf 0x0000000000000000
28 rpcrt4.dll RPC_INTERFACE::DispatchToStubWorker 0x0000000000000000
29 rpcrt4.dll RPC_INTERFACE::DispatchToStubWithObject 0x0000000000000000
30 rpcrt4.dll LRPC_SCALL::DispatchRequest 0x0000000000000000
31 rpcrt4.dll LRPC_SCALL::HandleRequest 0x0000000000000000
32 rpcrt4.dll LRPC_ADDRESS::HandleRequest 0x0000000000000000
33 rpcrt4.dll LRPC_ADDRESS::ProcessIO 0x0000000000000000
34 rpcrt4.dll LrpcIoComplete 0x0000000000000000
35 ntdll.dll TppAlpcpExecuteCallback 0x0000000000000000
36 ntdll.dll TppWorkerThread 0x0000000000000000
37 kernel32.dll BaseThreadInitThunk 0x0000000000000000
38 ntdll.dll __RtlUserThreadStart 0x0000000000000000
39 ntdll.dll _RtlUserThreadStart 0x0000000000000000
All in all there 2 questions I would like to find answers:
Is my understanding of a Store installed app launch process right?
Is initialization code relocating - from section after onLaunch to section after onActivate - a good approach to avoid future problems?

Detecting last mode of operation in an NMI handler

I am writing an NMI handler in an LKM. I would like to know the mode(user or kernel) of operation during the NMI fire. Is there any kernel flag to denote that? I am running Linux 4.18.0.
You can determine if cpu was in user or kernel mode by value of CS register, which is saved on stack by CPU in addition to RIP, RSP, SS etc.
Stack layout of interrupts is described in Intel® 64 and IA-32 ArchitecturesSoftware Developer’s ManualVolume 3A:System Programming Guide, Part Section 6.12.1
In kernel mode, saved CS value is __KERNEL_CS, in user mode - __USER_CS.
Code of default kernel nmi handler actually does this in /arch/x86/entry/entry_64.S:
ENTRY(nmi)
...
testb $3, CS-RIP+8(%rsp)
jz .Lnmi_from_kernel

Kernel panic using deferred_io on kmalloced buffer

I'm writing a framebuffer for an SPI LCD display on ARM. Before I complete that, I've written a memory only driver and trialled it under Ubuntu (Intel, Virtualbox). The driver works fine - I've allocated a block of memory using kmalloc, page aligned it (it's page aligned anyway actually), and used the framebuffer system to create a /dev/fb1. I have my own mmap function if that's relevant (deferred_io ignores it and uses its own by the look of it).
I have set:
info->screen_base = (u8 __iomem *)kmemptr;
info->fix.smem_len = kmem_size;
When I open /dev/fb1 with a test program and mmap it, it works correctly. I can see what is happening x11vnc to "share" the fb1 out:
x11vnc -rawfb map:/dev/fb1#320x240x16
And view with a vnc viewer:
gvncviewer strontium:0
I've made sure I've no overflows by writing to the entire mmapped buffer and that seems to be fine.
The problem arises when I add in deferred_io. As a test of it, I have a delay of 1 second and the called deferred_io function does nothing except a pr_devel() print. I followed the docs.
Now, the test program opens /dev/fb1 fine, mmap returns ok but as soon as I write to that pointer, I get a kernel panic. The following dump is from the ARM machine actually but it panics on the Ubuntu VM as well:
root#duovero:~/testdrv# ./fbtest1 /dev/fb1
Device opened: /dev/fb3
Screen is: 320 x 240, 16 bpp
Screen size = 153600 bytes
mmap on device succeeded
Unable to handle kernel paging request at virtual address bf81e020
pgd = edbec000
[bf81e020] *pgd=00000000
Internal error: Oops: 5 [#1] SMP ARM
Modules linked in: hhlcd28a(O) sysimgblt sysfillrect syscopyarea fb_sys_fops bnep ipv6 mwifiex_sdio mwifiex btmrvl_sdio firmware_class btmrvl cfg80211 bluetooth rfkill
CPU: 0 Tainted: G O (3.6.0-hh04 #1)
PC is at fb_deferred_io_fault+0x34/0xb0
LR is at fb_deferred_io_fault+0x2c/0xb0
pc : [<c0271b7c>] lr : [<c0271b74>] psr: a0000113
sp : edbdfdb8 ip : 00000000 fp : edbeedb8
r10: edbeedb8 r9 : 00000029 r8 : edbeedb8
r7 : 00000029 r6 : bf81e020 r5 : eda99128 r4 : edbdfdd8
r3 : c081e000 r2 : f0000000 r1 : 00001000 r0 : bf81e020
Flags: NzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
Control: 10c5387d Table: adbec04a DAC: 00000015
Process fbtest1 (pid: 485, stack limit = 0xedbde2f8)
Stack: (0xedbdfdb8 to 0xedbe0000)
[snipped out hexdump]
[<c0271b7c>] (fb_deferred_io_fault+0x34/0xb0) from [<c00db0c4>] (__do_fault+0xbc/0x470)
[<c00db0c4>] (__do_fault+0xbc/0x470) from [<c00dde0c>] (handle_pte_fault+0x2c4/0x790)
[<c00dde0c>] (handle_pte_fault+0x2c4/0x790) from [<c00de398>] (handle_mm_fault+0xc0/0xd4)
[<c00de398>] (handle_mm_fault+0xc0/0xd4) from [<c049a038>] (do_page_fault+0x140/0x37c)
[<c049a038>] (do_page_fault+0x140/0x37c) from [<c0008348>] (do_DataAbort+0x34/0x98)
[<c0008348>] (do_DataAbort+0x34/0x98) from [<c0498af4>] (__dabt_usr+0x34/0x40)
Exception stack(0xedbdffb0 to 0xedbdfff8)
ffa0: 00000280 0000ffff b6f5c900 00000000
ffc0: 00000003 00000000 00025800 b6f5c900 bea6dc1c 00011048 00000032 b6f5b000
ffe0: 00006450 bea6db70 00000000 000085d6 40000030 ffffffff
Code: 28bd8070 ebffff37 e2506000 0a00001b (e5963000)
---[ end trace 7e5ca57bebd433f5 ]---
Segmentation fault
root#duovero:~/testdrv#
I'm totally stumped - other drivers look more or less the same as mine but I assume they work. Most use vmalloc actually - is there a difference between kmalloc and vmalloc for this purpose?
Confirmed the fix so I'll answer my own question:
deferred_io changes the info mmap to its own that sets up fault handlers for writes to the video memory pages. In the fault handler it
checks bounds against info->fix.smem_len, so you must set that
gets the page that was written to.
For the latter case, it treats vmalloc differently from kmalloc (by checking info->screen_base to see if it's vmalloced). If you have vmalloced, it uses screen_base as the virtual address. If you have not used vmalloc, it assumes that the address of interest is the physical address in info->fix.smem_start.
So, to use deferred_io correctly
set screen_base (char __iomem *) and point that to the virtual address.
set info->fix.smem_len to the video buffer size
if you are not using vmalloc, you must set info->fix.smem_start to the video buffer's physical address by using virt_to_phys(vid_buffer);
Confirmed on Ubuntu as fixing the issue.
Really interesting, I'm currently implementing SPI-based display FB driver too (Sharp Memory LCD display and my VFDHack32 host driver). I also facing similar problem where it crashes at deferred_io. Can you share you source code ? mine is at my GitHub repo. P.S. that Memory LCD display is monochrome so I just pretend to be color display and just check whether the pixel byte is empty (dot off) or not empty (dot on).

DEP (Data Execution Prevention) violation on executable address?

My app crashed at startup in an MSHTML worker thread. The EXCEPTION_RECORD gives:
0:066> .exr 0e11f668
ExceptionAddress: 732019ab (rtutils!AcquireWriteLock+0x00000010)
ExceptionCode: c0000005 (Access violation)
ExceptionFlags: 00000000
NumberParameters: 2
Parameter[0]: 00000008
Parameter[1]: 732019ab
Attempt to execute non-executable address 732019ab
But !address shows that the address 732019ab is indeed executable:
0:066> !address 732019ab
Usage: Image
Base Address: 73201000
End Address: 7320a000
Region Size: 00009000
State: 00001000 MEM_COMMIT
Protect: 00000020 PAGE_EXECUTE_READ
Type: 01000000 MEM_IMAGE
Allocation Base: 73200000
Allocation Protect: 00000080 PAGE_EXECUTE_WRITECOPY
Image Path: C:\Windows\SysWOW64\rtutils.dll
Module Name: rtutils
Loaded Image Name: rtutils.dll
Mapped Image Name:
More info: lmv m rtutils
More info: !lmi rtutils
More info: ln 0x732019ab
More info: !dh 0x73200000
The instruction at 732019ab is:
0:066> u 732019ab l1
rtutils!AcquireWriteLock+0x10:
732019ab 8d4618 lea eax,[esi+18h]
Why is a DEP violation being reported at an address whose page is marked as PAGE_EXECUTE_WRITECOPY ?
Yep, that seems pretty impossible. I don't have an answer, but the list of possibilities is too long for a comment.
If I were to guess, I'd say something is playing with the protection flags on that page, but putting it back to PAGE_EXECUTE_READ after (or while) the exception is being raised. Start by seeing if your code (or any libraries you use) plays with VirtualProtect.
If that doesn't reveal anything, we can move onto some other possibilities:
Malware
Some malware likes to play with hooking/hotpatching and has been known to cause similar problems.
Faulty Antivirus
Antivirus applications employ a lot of the same tricks as malware. If issues stop after disabling it, you've found your culprit and can look at updating/replacing it.
A Bad Kernel Driver
In kernel mode, you can achieve the impossible accidentally, but never on purpose. :)
A Faulty CPU
An overclocked or poorly cooled CPU can cause many unpredictable things to happen. Not likely, but possible.

How can I relocate main() to 0x00000000?

Here's the nm dump of my program.
00000000 T __ctors_end
00000000 T __ctors_start
00000000 T __dtors_end
00000000 T __dtors_start
00000000 a __tmp_reg__
00000000 T __trampolines_end
00000000 T __trampolines_start
00000000 T setup
00000001 a __zero_reg__
0000003d a __SP_L__
0000003e a __SP_H__
0000003f a __SREG__
00000072 T __vector_15
00000086 T main
000000a8 A __data_load_end
000000a8 A __data_load_start
000000a8 T _etext
00800100 D _edata
00800100 T _end
00810000 T __eeprom_end
The architecture is AVR, and I need to get main() back up to 0x00000000 in order for the chip that I'm running this code on to execute properly. It should be as simple as a linker script, shouldn't it?
It doesn't matter where main() is in memory. Simply put a jump instruction to its address at the reset vector, or 0x0000 in application memory.
I used to program for AVR and as I know the only way to change main() entry is fuse bits. But you just can to put in the back of FLASH for bootloader. Depending on chip main starts in different places, I'm not sure but on AVR it should be something like 0x20 to 0x100.
It is because at the beginning there is RESET vector, registers and interrupt vectors.
This structure helps very much, once I had a project on which I wasn't able to use watchdog so the only way to trigger reset was overflow.
Also, I've read your comment. You don't need to put 256 bytes of 0x00 that place is for some registers (AVR registers are divided in to places one is SRAM, other FLASH) and interrupt vectors, so if you use lets say timer or UART and your code start at 0x00 so initialization of these would destroy your code.
It is designed to work, I think redesigning would spoil that. But if you really want this, you can try to add -Ttext=0x0000 this flag. This may compile it as you want but I do not recomend doing that.

Resources