How to debug IACCVIOL where all registers have been cleared? - debugging

I'm developing a complex and intricate app for STM32F746. I stumbled upon the following hard fault and I'm not sure how to find the origin of the problem :
16:05:51.832 HardFault : ExceptionFrame { r0: 0xffffffff, r1: 0xffffffff, r2: 0xffffffff, r3: 0xffffffff, r12: 0xffffffff, lr: 0xffffffff, pc: 0xffffffff, xpsr: 0xffffffff }
16:05:51.832 UFSR : 0000000000000000
16:05:51.832 BFSR : 00000000
16:05:51.832 MMFSR : 00000001
16:05:51.832 HFSR : 01000000000000000000000000000000
The MMFSR part of the CFSR clearly indicates the error is IACCVIOL. Unfortunately, MMARVALID is not set, so I can't use the MMFAR to find the root of the issue.
Stepping through with GDB takes a huge amount of time for very little progress, as I need to start over every time the fault appears. I couldn't find a way to record/replay the session to quickly track down the issue in GDB.
Is there an approach that could help me pinpoint where the code fails ?

Related

Debugging u-boot crash

I am facing some data abort in u-boot and not able to find the root cause the issue. Can some tell me the ways how we can trace logs here or how to debug and decode these logs.
In u-boot which file gives the required details-
Below is the crash logs-:
data abort
pc : [<fff3fcb8>] lr : [<1a000018>]
reloc pc : [<B30017cb8>] lr : [<4a0d8018>]
sp : fdf17e5c ip : fff88a6c fp : 00000017
r10: 30061f88 r9 : fdf17ef8 r8 : fdf18a78
r7 : 00000010 r6 : 00000028 r5 : fdf3d138 r4 : 17f18ab8
r3 : fdf18a88 r2 : 00000018 r1 : fdf18aa0 r0 : 00000000
Flags: nzCv IRQs off FIQs off Mode SVC_32
Resetting CPU ...
BR,Abhi
First you need to disassemble the U-Boot file. Which version of objdump you need depends on your host and destination architecture, e.g.
arm-linux-gnueabihf-objdump -sD u-boot > u-boot.txt
Then look for the reloc pc address.

How can I see the full backtrace using kgdb to debug an ARM Linux module?

I worked my way through all of the free Linux training materials created by Free Electrons. In the last lab, we learn to use kgdb to remotely debug a simple crash in a loadable module. The crash is caused by a null pointer dereference in a memzero function call.
I am using Linux kernel 4.9 and a BeagleBone Black as the target, all according to the recommendations for the labs, and I've had no problems up to this point. My host is Ubuntu xenial and I am using standard packages for the ARM toolchain (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.4) and gdb (7.11.1-0ubuntu1~16.04) debugger.
gdb is able to read the symbol tables from vmlinux and from the module with the bug in it, which is called drvbroken.ko. The module has a bug in its init function, so it crashes immediately when I insmod it.
gdb output:
(gdb) backtrace
#0 __memzero () at arch/arm/lib/memzero.S:69
#1 0x00000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) list 69
64 ldmeqfd sp!, {pc} # 1/2 quick exit
65 /*
66 * No need to correct the count; we're only testing bits from now on
67 */
68 tst r1, #32 # 1
69 stmneia r0!, {r2, r3, ip, lr} # 4
70 stmneia r0!, {r2, r3, ip, lr} # 4
71 tst r1, #16 # 1 16 bytes or more?
72 stmneia r0!, {r2, r3, ip, lr} # 4
73 ldr lr, [sp], #4 # 1
The result is the same whether I build the kernel with CONFIG_ARM_UNWIND (the default) or disable that and use CONFIG_FRAME_POINTER (the old method recommended by the lab notes).
I tried the same procedure in kdb, and here I see a very long backtrace that includes the calling functions. The caller of memzero is cdev_init.
kdb output:
Entering kdb (current=0xde616240, pid 106) on processor 0 Oops: (null)
due to oops # 0xc04c2be0
CPU: 0 PID: 106 Comm: insmod Tainted: G O 4.9.0-dirty #1
Hardware name: Generic AM33XX (Flattened Device Tree)
task: de616240 task.stack: de676000
PC is at __memzero+0x40/0x7c
LR is at 0x0
pc : [<c04c2be0>] lr : [<00000000>] psr: 00000013
sp : de677da4 ip : 00000000 fp : de677dbc
r10: bf000240 r9 : 219a3868 r8 : 00000000
r7 : de65c7c0 r6 : de6420c0 r5 : bf0000b4 r4 : 00000000
r3 : 00000000 r2 : 00000000 r1 : fffffffc r0 : 00000000
Flags: nzcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
Control: 10c5387d Table: 9e69c019 DAC: 00000051
CPU: 0 PID: 106 Comm: insmod Tainted: G O 4.9.0-dirty #1
Hardware name: Generic AM33XX (Flattened Device Tree)
Backtrace:
... pruned function calls related to kdb itself ...
[<c08326cc>] (do_page_fault) from [<c010138c>] (do_DataAbort+0x3c/0xbc)
r10:bf000240 r9:de676000 r8:de677d50 r7:00000000 r6:c08326cc r5:00000817
r4:c0d0bb2c
[<c0101350>] (do_DataAbort) from [<c0831d04>] (__dabt_svc+0x64/0xa0)
Exception stack(0xde677d50 to 0xde677d98)
7d40: 00000000 fffffffc 00000000 00000000
7d60: 00000000 bf0000b4 de6420c0 de65c7c0 00000000 219a3868 bf000240 de677dbc
7d80: 00000000 de677da4 00000000 c04c2be0 00000013 ffffffff
r8:00000000 r7:de677d84 r6:ffffffff r5:00000013 r4:c04c2be0
[<c02bf44c>] (cdev_init) from [<bf002048>] (init_module+0x48/0xb4 [drvbroken])
r5:bf002000 r4:bf000480
[<bf002000>] (init_module [drvbroken]) from [<c01018d4>] (do_one_initcall+0x44/0x180)
r5:bf002000 r4:ffffe000
[<c0101890>] (do_one_initcall) from [<c024fa2c>] (do_init_module+0x64/0x1d8)
r8:00000001 r7:de65c7c0 r6:de6420c0 r5:c0dbfa84 r4:bf000240
[<c024f9c8>] (do_init_module) from [<c01e10e8>] (load_module+0x1d6c/0x23d8)
r6:c0d0512c r5:c0dbfa84 r4:c0d4c70f
[<c01df37c>] (load_module) from [<c01e18ac>] (SyS_init_module+0x158/0x17c)
r10:00000051 r9:de676000 r8:e0a95100 r7:00000000 r6:000ac118 r5:00004100
It is pretty easy to figure out where to look for the bug with this information, but alas, it is not possible to get a line number or list the source directly from kdb. This is much easier in gdb, assuming that I can get a full backtrace.

Crash at init time of Cobalt 8.20698

The latest version of Cobalt(8.20698) will crash at init time on arm linux platform, the backtrace is as follows, but the old version doesn't has this issue, could anyone help to have a look?
[00000000] *pgd=0dce6831, *pte=00000000, *ppte=00000000
CPU: 0 PID: 4268 Comm: cobalt_qa Tainted: P O 3.10.79 #2
task: cf33b400 ti: d24bc000 task.ti: d24bc000
PC is at 0xb5d12180
LR is at 0x161610
pc : [<b5d12180>] lr : [<00161610>] psr: 600f0010
sp : bed2fc20 ip : b5d12180 fp : 00000000
r10: bed30088 r9 : bed2ff78 r8 : bed2fe84
r7 : 00000002 r6 : 00000000 r5 : 00000000 r4 : 01027e68
r3 : 00000043 r2 : 00000049 r1 : 0000002e r0 : 00000000
Flags: nZCv IRQs on FIQs on Mode USER_32 ISA ARM Segment user
Control: 10c5387d Table: 124d406a DAC: 00000015
CPU: 0 PID: 4268 Comm: cobalt_qa Tainted: P O 3.10.79 #2
[<c0012c20>] (unwind_backtrace+0x0/0xdc) from [<c0010ef8>] (show_stack+0x10/0x14)
[<c0010ef8>] (show_stack+0x10/0x14) from [<c0014204>] (__do_user_fault+0x13c/0x1ac)
[<c0014204>] (__do_user_fault+0x13c/0x1ac) from [<c001449c>] (do_page_fault+0x228/0x268)
[<c001449c>] (do_page_fault+0x228/0x268) from [<c0008328>] (do_DataAbort+0x34/0x120)
[<c0008328>] (do_DataAbort+0x34/0x120) from [<c000dab4>] (__dabt_usr+0x34/0x40)
Exception stack(0xd24bdfb0 to 0xd24bdff8)
dfa0: 00000000 0000002e 00000049 00000043
dfc0: 01027e68 00000000 00000000 00000002 bed2fe84 bed2ff78 bed30088 00000000
dfe0: b5d12180 bed2fc20 00161610 b5d12180 600f0010 ffffffff
Caught signal: SIGSEGV (11)
<unknown> [0xb5d12180]
uprv_getDefaultLocaleID_56 [0x161610]
icu_56::locale_set_default_internal() [0x15a114]
icu_56::Locale::getDefault() [0x159ca0]
locale_get_default_56 [0x159cb0]
EzTimeValueExplode [0xb4d10]
EzTimeTExplode [0xb5048]
EzTimeTExplodeLocal [0xb5838]
logging::LogMessage::Init() [0x7b7cc]
logging::LogMessage::LogMessage() [0x7bcf4]
base::UserLog::IsRegistrationSupported() [0x6b108]
cobalt::browser::Application::RegisterUserLogs() [0x2c608]
cobalt::browser::Application::Application() [0x2d998]
cobalt::browser::CreateApplication() [0x2b278]
SbEventHandle [0x2b0c0]
starboard::shared::starboard::Application::DispatchStart() [0xbadec]
starboard::shared::starboard::Application::Run() [0xbb4e0]
main [0x21c24]
<unknown> [0xb5cb2278]
After tracing the code of Cobalt, the cobalt need to get the posix_id by SbSystemGetLocaledId() in system_get_locale_id.cc, but the system didn't set the clang environment variable yet, and it get null which made the Crash, after setting the LANG environment variable(export LANG="en_US.UTF-8"), it works.
Add CLANG environment variable

Need help to understand kernel debugging error

Need help to understand kernel debugging error.
When I put my driver for Whck test for windows 8(32/64 bit), it fails CHAOS in RUN TEST.
So I did kernel debugging and got following debug message.But I don't understand where is the error in my ioctl.c file.Same driver has cleared the test for windows 7 32 bit.
*** Fatal System Error: 0x0000000a
(0x00000031,0x00000002,0x00000000,0x81CB1194)
Break instruction exception - code 80000003 (first chance)
A fatal system error has occurred.
Debugger entered on first try; Bugcheck callbacks have not been invoked.
A fatal system error has occurred.
nt!RtlpBreakWithStatusInstruction:
818d6ca4 cc int 3
2: kd> !analyze -v
Connected to Windows 8 9200 x86 compatible target at (Tue May 27 11:56:02.788 2014 (UTC - 7:00)), ptr64 FALSE
Loading Kernel Symbols
...............................................................
.............................................
Press ctrl-c (cdb, kd, ntsd) or ctrl-break (windbg) to abort symbol loads that take too long.
Run !sym noisy before .reload to track down problems loading symbols.
...................
........................
Loading User Symbols
Loading unloaded module list
.........Unable to enumerate user-mode unloaded modules, Win32 error 0n30
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************
IRQL_NOT_LESS_OR_EQUAL (a)
An attempt was made to access a pageable (or completely invalid) address at an
interrupt request level (IRQL) that is too high. This is usually
caused by drivers using improper addresses.
If a kernel debugger is available get the stack backtrace.
Arguments:
Arg1: 00000031, memory referenced
Arg2: 00000002, IRQL
Arg3: 00000000, bitfield :
bit 0 : value 0 = read operation, 1 = write operation
bit 3 : value 0 = not an execute operation, 1 = execute operation (only on chips which support this level of status)
Arg4: 81cb1194, address which referenced memory
Debugging Details:
------------------
READ_ADDRESS: 00000031
CURRENT_IRQL: 2
FAULTING_IP:
nt!VerifierKeSynchronizeExecution+26
81cb1194 0fb64631 movzx eax,byte ptr [esi+31h]
DEFAULT_BUCKET_ID: WIN8_DRIVER_FAULT
BUGCHECK_STR: AV
PROCESS_NAME: System
TRAP_FRAME: b74a594c -- (.trap 0xffffffffb74a594c)
ErrCode = 00000000
eax=9132b7d8 ebx=b23f4a38 ecx=b74a59d8 edx=9184c628 esi=00000000 edi=9184c570
eip=81cb1194 esp=b74a59c0 ebp=b74a59c4 iopl=0 nv up ei pl zr na pe nc
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010246
nt!VerifierKeSynchronizeExecution+0x26:
81cb1194 0fb64631 movzx eax,byte ptr [esi+31h] ds:0023:00000031=??
Resetting default scope
LAST_CONTROL_TRANSFER: from 818fefc7 to 818d6ca4
STACK_TEXT:
b74a54e4 818fefc7 00000003 d565b8ac 00000031 nt!RtlpBreakWithStatusInstruction
b74a5534 818fe861 00000003 8286a340 b74a5934 nt!KiBugCheckDebugBreak+0x1c
b74a5908 818d56a6 0000000a 00000031 00000002 nt!KeBugCheck2+0x655
b74a592c 8194ed9b 0000000a 00000031 00000002 nt!KiBugCheck2+0xc6
b74a592c 81cb1194 0000000a 00000031 00000002 nt!KiTrap0E+0x1b3
b74a59c4 9132b7d8 00000000 9132ef20 b74a59d8 nt!VerifierKeSynchronizeExecution+0x26
b74a5a30 81ca1f9b 9184c570 adcf4f00 adcf4f00 OxSer!OxserInternalIoControl+0x328 [c:\users\admin\desktop\trunk\uart_v7.0\source\uart\driver\wdm\ioctl.c # 2570]
b74a5a50 81830066 81cb97fd adcf4fd4 adcf4ff8 nt!IovCallDriver+0x2e3
b74a5a64 81cb97fd b74a5a8c 81cb98f4 9184c570 nt!IofCallDriver+0x73
b74a5a6c 81cb98f4 9184c570 adcf4f00 ace85a30 nt!ViFilterIoCallDriver+0x10
b74a5a8c 81ca1f9b ace85ae8 adcf4f00 81ca27c1 nt!ViFilterDispatchGeneric+0x5e
b74a5aac 81830066 8f7eab44 ace85a30 8ad0c710 nt!IovCallDriver+0x2e3
b74a5ac0 8f7eab44 b74a5b0c b74a5b0c b74a5c14 nt!IofCallDriver+0x73
b74a5ad0 8f7ea625 001b0010 00000001 ace85a30 serenum!Serenum_IoSyncIoctlEx+0x48
b74a5c14 8f7e537d b7196ed8 b74a5c33 b7a84340 serenum!Serenum_ReenumerateDevices+0x259
b74a5c34 81866b1b b7196ed8 d565b1e8 00000000 serenum!SerenumEnumThread+0x57
b74a5c70 81950579 8f7e5326 8ad0c710 00000000 nt!PspSystemThreadStartup+0x4a
00000000 00000000 00000000 00000000 00000000 nt!KiThreadStartup+0x19
STACK_COMMAND: kb
FOLLOWUP_IP:
OxSer!OxserInternalIoControl+328 [c:\users\admin\desktop\trunk\uart_v7.0\source\uart\driver\wdm\ioctl.c # 2570]
9132b7d8 8b4dcc mov ecx,dword ptr [ebp-34h]
FAULTING_SOURCE_LINE: c:\users\admin\desktop\trunk\uart_v7.0\source\uart\driver\wdm\ioctl.c
FAULTING_SOURCE_FILE: c:\users\admin\desktop\trunk\uart_v7.0\source\uart\driver\wdm\ioctl.c
FAULTING_SOURCE_LINE_NUMBER: 2570
FAULTING_SOURCE_CODE:
No source found for 'c:\users\admin\desktop\trunk\uart_v7.0\source\uart\driver\wdm\ioctl.c'
SYMBOL_STACK_INDEX: 6
SYMBOL_NAME: OxSer!OxserInternalIoControl+328
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: OxSer
IMAGE_NAME: OxSer.sys
DEBUG_FLR_IMAGE_TIMESTAMP: 53802c0f
BUCKET_ID_FUNC_OFFSET: 328
FAILURE_BUCKET_ID: AV_VRF_OxSer!OxserInternalIoControl
BUCKET_ID: AV_VRF_OxSer!OxserInternalIoControl
Followup: MachineOwner
---------
The routine that crashed was actually in the OS verifier. This is a set of function that perform additional validation on driver calls when driver development is performed in order to find driver bugs.
You are probably not crashing on Win7 because either the verifier is not turned on or the verifier was not detecting this problem in Win7. While your code is not crashing, it is probably still doing something that will cause OS instability at some point.
You should view this as Win8 helping you identify a real bug much more easily, rather than under weird circumstances after you shipped your driver.

What caused my NDIS miniport driver crashed on XP OS

I wrote a simple packet filter driver based on the example 'passthru' of the Windows DDK, when I turned on the filter function, the OS is crashed and I got the following message from the WinDbg:
Microsoft (R) Windows Debugger Version 6.12.0002.633 X86 Copyright (c)
Microsoft Corporation. All rights reserved.
Loading Dump File [D:\iCheckTool\dump\MEMORY.DMP] Kernel Summary Dump
File: Only kernel address space is available
WARNING: Whitespace at start of path element Symbol search path is:
D:\iCheckTool\dump;
SRV*E:\DebuggingSymbols*http://msdl.microsoft.com/download/symbols;SRV*C:\MyLocalSymbols*http://192.168.20.25/zfprisymbols/
Executable search path is: Windows XP Kernel Version 2600 (Service
Pack 3) MP (2 procs) Free x86 compatible Product: WinNt, suite:
TerminalServer SingleUserTS Built by: 2600.xpsp_sp3_qfe.120504-1617
Machine Name: Kernel base = 0x804d8000 PsLoadedModuleList = 0x8055e720
Debug session time: Tue Sep 11 09:41:02.828 2012 (UTC + 8:00) System
Uptime: 0 days 0:02:30.578 Loading Kernel Symbols
...............................................................
............................................................. Loading
User Symbols PEB is paged out (Peb.Ldr = 7ffd800c). Type ".hh
dbgerr001" for details Loading unloaded module list ........
*
Bugcheck Analysis *
*
Use !analyze -v to get detailed debugging information.
BugCheck C5, {4, 2, 1, 8054c10f}
Probably caused by : Pool_Corruption ( nt!ExDeferredFreePool+109 )
Followup: Pool_corruption
1: kd> !analyze -v
*
Bugcheck Analysis *
*
DRIVER_CORRUPTED_EXPOOL (c5) An attempt was made to access a pageable
(or completely invalid) address at an interrupt request level (IRQL)
that is too high. This is caused by drivers that have corrupted the
system pool. Run the driver verifier against any new (or suspect)
drivers, and if that doesn't turn up the culprit, then use gflags to
enable special pool. Arguments: Arg1: 00000004, memory referenced
Arg2: 00000002, IRQL Arg3: 00000001, value 0 = read operation, 1 =
write operation Arg4: 8054c10f, address which referenced memory
Debugging Details:
BUGCHECK_STR: 0xC5_2
CURRENT_IRQL: 2
FAULTING_IP: nt!ExDeferredFreePool+109 8054c10f 895f04 mov
dword ptr [edi+4],ebx
DEFAULT_BUCKET_ID: DRIVER_FAULT
PROCESS_NAME: explorer.exe
TRAP_FRAME: b42555dc -- (.trap 0xffffffffb42555dc) ErrCode = 00000002
eax=89cc1c60 ebx=89e4ded8 ecx=000001ff edx=89cc2a78 esi=80565d20
edi=00000000 eip=8054c10f esp=b4255650 ebp=b4255690 iopl=0 nv
up ei ng nz ac pe cy cs=0008 ss=0010 ds=0023 es=0023 fs=0030
gs=0000 efl=00010297 nt!ExDeferredFreePool+0x109: 8054c10f
895f04 mov dword ptr [edi+4],ebx
ds:0023:00000004=???????? Resetting default scope
LOCK_ADDRESS: 8055c4e0 -- (!locks 8055c4e0)
Resource # nt!PiEngineLock (0x8055c4e0) Available
Contention Count = 1 1 total locks
PNP_TRIAGE: Lock address : 0x8055c4e0 Thread Count : 0 Thread
address: 0x00000000 Thread wait : 0x0
LAST_CONTROL_TRANSFER: from 8054c10f to 80545768
STACK_TEXT: b42555dc 8054c10f badb0d00 89cc2a78 b8338538
nt!KiTrap0E+0x238 b4255690 8054c75f 00000001 8055c100 00020019
nt!ExDeferredFreePool+0x109 b42556d0 8058635e 899522e8 00000000
b42557d8 nt!ExFreePoolWithTag+0x47f b42556fc 805878b8 c0000023
00000007 8058758c nt!PiGetDeviceRegistryProperty+0x108 b425578c
bf879f40 8a523030 00000001 00000100 nt!IoGetDeviceProperty+0x25e
b42558f8 bf879735 00000000 e1b5e008 00000000
win32k!DrvEnumDisplayDevices+0x33b b425591c 8054268c 00000000 00000000
0007ecc4 win32k!NtUserEnumDisplayDevices+0x7c b425591c 7c92e514
00000000 00000000 0007ecc4 nt!KiFastCallEntry+0xfc WARNING: Frame IP
not in any known module. Following frames may be wrong. 0007f010
00000000 00000000 00000000 00000000 0x7c92e514
STACK_COMMAND: kb
FOLLOWUP_IP: nt!ExDeferredFreePool+109 8054c10f 895f04 mov
dword ptr [edi+4],ebx
SYMBOL_STACK_INDEX: 1
SYMBOL_NAME: nt!ExDeferredFreePool+109
FOLLOWUP_NAME: Pool_corruption
IMAGE_NAME: Pool_Corruption
DEBUG_FLR_IMAGE_TIMESTAMP: 0
MODULE_NAME: Pool_Corruption
FAILURE_BUCKET_ID: 0xC5_2_nt!ExDeferredFreePool+109
BUCKET_ID: 0xC5_2_nt!ExDeferredFreePool+109
Followup: Pool_corruption
Can someone tell me what caused this problem and how to fix it?
Thanks.
Apparently, you tried to write into invalid memory region (address = 0x4). Beyond this the debugger analysis you posted isn't too helpful. You can try finding your driver stack (which is not present in your posted debug output) in the debugger to get the failing code, but it's not guaranteed. Other methods to attack this include adding debug prints to your code and capturing it with DbgView (you can later extract them from the memory dump). And you can also connect kernel debugger and catch the error when it happens.

Resources