understanding kernel oops error code - debugging

in kernel oops of ARM following logs are printed in kernel logs -
<1>[ 4205.112835] I[0:swapper/0:0] [c0] Unable to handle kernel paging request at virtual address ff898580
<1>[ 4205.112874] I[0:swapper/0:0] [c0] pgd = ec3c4000
<1>[ 4205.112901] I[0:swapper/0:0] [c0] [ff898580] *pgd=00000000
<0>[ 4205.112939] I[0:swapper/0:0] [c0] Internal error: Oops: 80000005 #1] PREEMPT SMP ARM
Sometimes the oops this code is -
Internal error: Oops - undefined instruction: 0 [#1] PREEMPT SMP ARM
and in most of the logs it is -
Internal error: Oops: 5 [#1] PREEMPT SMP ARM
Can someone explain the purpose of this code and its meaning?

The information you provided is quite few.
As in arch/arm/kernel/traps.c
You will find
printk(KERN_EMERG "Internal error: %s: %x [#%d]" S_PREEMPT S_SMP S_ISA "\n", str, err, ++die_counter);
Actually whole stack trace will be much more helpful, you will find bug location and by disassembling to find real place in code.
Just guessing, you touched a NULL pointer ==

Related

unhandled level 3 permission fault (11) [ error disappears when running gdb...! ]

So I am trying to run my c++ application on an aarch64(ARM 8). ***When run using GDB the application runs without any problem. But otherwise it gives me a segmentation fault.***I checked dmesg and it goes as
unhandled level 3 permission fault (11) at 0x004ac010, esr 0x8300000f
[241808.064733] pgd = ffffffc0fe270000
[241808.068270] [004ac010] *pgd=00000001615c9003, *pmd=000000016f316003, *pte=02e0000147f42f53
[241808.076813]
[241808.076824] CPU: 2 PID: 12503 Comm: Jumpi Not tainted 3.10.67-g3a5c467 #1
[241808.076832] task: ffffffc0fef9c080 ti: ffffffc0f0fe4000 task.ti: ffffffc0f0fe4000
[241808.076841] PC is at 0x4ac010
[241808.076846] LR is at 0x401cb8
[241808.076852] pc : [<00000000004ac010>] lr : [<0000000000401cb8>] pstate: 20000000
[241808.076857] sp : 0000007fc044b600
[241808.076863] x29: 0000007fc044b680 x28: 0000000000000000
[241808.076873] x27: 0000000000000000 x26: 0000000000000000
[241808.076882] x25: 00000000004186ec x24: 0000000000418634
I tried set disable-randomization off in gdb but still no error.I then tried valgrind. I get a lot of error messages saying unitialised value was created ,mostly at dl_init_paths.But more importantly I get the bad permission generating SISGEV at a memory address which when i went through memory seems to be in (env_path_list) .
That where i am at after debugging for hours.If anyone has any suggestions/ideas about the next steps that would be helpful.
Another interesting fact is when the same code was compiled using a cross compiler and ran on this (ARM8) it works fine...!!
You can find detalied reason of fault in 'esr' register which already printed in crash dump. You can use armv8 spec to decode value of 'esr' register.

What does "kernel tainted" mean?

My OS is Fedora 17. Recently, kernel tainted warning "kernel bug at kernel/auditsc.c:1772!-abrt" occurs:
This problem should not be reported (it is likely a known problem). A kernel problem occurred, but your kernel has been tainted (flags:GD). Kernel maintainers are unable to diagnose tainted reports.
Then, I get the following:
# cat /proc/sys/kernel/tainted
128
# dmesg | grep -i taint
[ 8306.955523] Pid: 4511, comm: chrome Tainted: G D 3.9.10-100.fc17.i686.PAE #1 Dell Inc.
[ 8307.366310] Pid: 4571, comm: chrome Tainted: G D 3.9.10-100.fc17.i686.PAE #1 Dell Inc.
It seems that the value "128" is much serious:
128 – The system has died.
How about this warning? Since chrome is flagged as the "Tainted" source, anybody also meet this matter?
To (over) simplify, 'tainted' means that the kernel is in a state other than what it would be in if it were built fresh from the open source origin and used in a way that it had been intended. It is a way of flagging a kernel to warn people (e.g., developers) that there may be unknown reasons for it to be unreliable, and that debugging it may be difficult or impossible.
In this case, 'GD' means that all modules are licensed as GPL or compatible (ie not proprietary), and that a crash or BUG() occurred.
The reasons are listed below:
See: oops-tracing.txt
---------------------------------------------------------------------------
Tainted kernels:
Some oops reports contain the string 'Tainted: ' after the program
counter. This indicates that the kernel has been tainted by some
mechanism. The string is followed by a series of position-sensitive
characters, each representing a particular tainted value.
1: 'G' if all modules loaded have a GPL or compatible license, 'P' if
any proprietary module has been loaded. Modules without a
MODULE_LICENSE or with a MODULE_LICENSE that is not recognised by
insmod as GPL compatible are assumed to be proprietary.
2: 'F' if any module was force loaded by "insmod -f", ' ' if all
modules were loaded normally.
3: 'S' if the oops occurred on an SMP kernel running on hardware that
hasn't been certified as safe to run multiprocessor.
Currently this occurs only on various Athlons that are not
SMP capable.
4: 'R' if a module was force unloaded by "rmmod -f", ' ' if all
modules were unloaded normally.
5: 'M' if any processor has reported a Machine Check Exception,
' ' if no Machine Check Exceptions have occurred.
6: 'B' if a page-release function has found a bad page reference or
some unexpected page flags.
7: 'U' if a user or user application specifically requested that the
Tainted flag be set, ' ' otherwise.
8: 'D' if the kernel has died recently, i.e. there was an OOPS or BUG.
9: 'A' if the ACPI table has been overridden.
10: 'W' if a warning has previously been issued by the kernel.
(Though some warnings may set more specific taint flags.)
11: 'C' if a staging driver has been loaded.
12: 'I' if the kernel is working around a severe bug in the platform
firmware (BIOS or similar).
13: 'O' if an externally-built ("out-of-tree") module has been loaded.
14: 'E' if an unsigned module has been loaded in a kernel supporting
module signature.
15: 'L' if a soft lockup has previously occurred on the system.
16: 'K' if the kernel has been live patched.
The primary reason for the 'Tainted: ' string is to tell kernel
debuggers if this is a clean kernel or if anything unusual has
occurred. Tainting is permanent: even if an offending module is
unloaded, the tainted value remains to indicate that the kernel is not
trustworthy.
Also showing numbers for the content of /proc/sys/kernel/tainted file:
Non-zero if the kernel has been tainted. Numeric values, which can be
ORed together. The letters are seen in "Tainted" line of Oops reports.
1 (P): A module with a non-GPL license has been loaded, this
includes modules with no license.
Set by modutils >= 2.4.9 and module-init-tools.
2 (F): A module was force loaded by insmod -f.
Set by modutils >= 2.4.9 and module-init-tools.
4 (S): Unsafe SMP processors: SMP with CPUs not designed for SMP.
8 (R): A module was forcibly unloaded from the system by rmmod -f.
16 (M): A hardware machine check error occurred on the system.
32 (B): A bad page was discovered on the system.
64 (U): The user has asked that the system be marked "tainted". This
could be because they are running software that directly modifies
the hardware, or for other reasons.
128 (D): The system has died.
256 (A): The ACPI DSDT has been overridden with one supplied by the user
instead of using the one provided by the hardware.
512 (W): A kernel warning has occurred.
1024 (C): A module from drivers/staging was loaded.
2048 (I): The system is working around a severe firmware bug.
4096 (O): An out-of-tree module has been loaded.
8192 (E): An unsigned module has been loaded in a kernel supporting module
signature.
16384 (L): A soft lockup has previously occurred on the system.
32768 (K): The kernel has been live patched.
65536 (X): Auxiliary taint, defined and used by for distros.
131072 (T): The kernel was built with the struct randomization plugin.
Source: https://www.kernel.org/doc/Documentation/sysctl/kernel.txt
Credit: https://askubuntu.com/questions/248470/what-does-the-kernel-taint-value-mean

Kernel not booting in armada 370 board

I am using customized armada 370 board based on ARMv7.
I am able to successfully load the u-boot. But when loaded the linux kernel through "loadb command" directly into DRAM, i am getting below error.
Error:-
########################################
[    0.400000] Unhandled fault: imprecise external abort (0x1406) at 0x00000000
[    0.400000] Internal error: : 1406 [#1] PREEMPT
[    0.400000] last sysfs file:
[    0.400000] Modules linked in:
[    0.400000] CPU: 0    Not tainted  (2.6.34.10-WR4.3.0.0_standard #73)
[    0.400000] PC is at trace_hardirqs_on+0x0/0x10
[    0.400000] LR is at kernel_thread_helper+0x4/0x14
########################################
Below is the specification at which board is running.
CPU freq - 1000MHz
DDR & L2 cache freq - 667MHz
I am using DDR3 SDRAM
I am using linux kernel 2.6.34 marvel armada370 package from Windriver Linux.
I tried booting the same kernel image in the marvel reference board and it is working fine.
I read in some article, this errors are related to RAM.
But in u-boot, i am able to do successfull read and write operations.
I analysed the log and i found that the value 0x1406 specifies the Data fault status register.
In this article, i decoded the value and the error is pointing to AXI slave read error.
Can you help why i am getting this error.
Thanks in advance.
Thanks & Regards
Shamshad

Booting Linux kernel in AT91SAM9260

I am try to understand the build and booting process of Linux kernel for ARM. I took vanila linux from www.kernel.org and build it after run configuration for AT91SAM9260.
In message when we compile the kernel showed that:
==========================================
LD vmlinux
SORTEX vmlinux
SYSMAP System.map
OBJCOPY arch/arm/boot/Image
Kernel: arch/arm/boot/Image is ready
GZIP arch/arm/boot/compressed/piggy.gzip
AS arch/arm/boot/compressed/piggy.gzip.o
LD arch/arm/boot/compressed/vmlinux
OBJCOPY arch/arm/boot/zImage
Kernel: arch/arm/boot/zImage is ready
UIMAGE arch/arm/boot/uImage
Image Name: Linux-3.9.1+
Created: Sat Nov 23 18:15:58 2013
Image Type: ARM Linux Kernel Image (uncompressed)
Data Size: 1635544 Bytes = 1597.21 kB = 1.56 MB
Load Address: 20008000
Entry Point: 20008000
Image arch/arm/boot/uImage is ready
==========================================
My questions are:
Image type is uncompressed, this means that we don't compress vmlinux to zImage ?
Load Address: 20008000: this is address of decompressed image = ZRELADDR defined in arch/arm/boot/Makefile ?
This address also is address of ../arm/kernel/head.o ?
It seems that we don't use address KERNEL_PHYS , this method is common way or just for AT91SAM family ?
Basically, our procedures to build and booting are:
a. building kernel steps: vmlinux -> uImage (skip to create zImage).
b. kernel booting steps: DataFlash/NAND --load-->uImage (# 0x22200000) ---decompress--> uncompressed image (# 0x20008000).
In this case, no zImage in booting process although in build message I saw zImage created. I'm wrong ?
4 . How about the address 0xC0008000 which I found in /arch/arm/kernel/vmlinux.lds at line:
. = 0xC0000000 + 0x00008000;
Do we use it ?. I confuse this address with the ZRELADDR.
Regards.
The uImage file has most probably been built using the zImage. It says uncompressed because the uImage itself is not compressed.
The load address can be used by the boot-loader to store data necessary for the early phases of the Linux kernel boot (such as the command line defined in the bootloader.)
You're right about the boot process. But when the zImage is used then the decompression is done by the kernel instead of the bootloader. See decompress_kernel()
The address 0xc0008000 is a virtual address. It maps to the physical address 0x20008000. Virtual addresses can be used only after Linux sets up the memory translation (MMU).

WinDbg not showing useful information

First let me say I am a total WinDbg noob, so this might be an easy question...
I have an application ("MyApp" - name changed to protect the innocent!) that I am trying to debug because it is throwing an exception. This only happens on user machines - I have not been able to reproduce it on my development machine. So I set up DebugDiag on the users machine and captured a Full Dump. Then I loaded the dump in WinDbg and did an analyze -v and a kp to try to figure out what was going on... but neither of these seem to give me the information that I'm looking for - the function (and hopefully the line number) of the line that is causing the problem... I think I have the symbol file loaded by specifying the path to 'MyApp.pdb' in the Symbol File Path:
srv*c:\symcache*http://msdl.microsoft.com/download/symbols;srv*c:\symcache*C:\dev\Customer\MyAppSln\MyApp\Debug
First, here's the output from kp:
0:004> kp
ChildEBP RetAddr
WARNING: Stack unwind information not available. Following frames may be wrong.
0502f474 7c347966 MyApp!DllMain+0x3e8a6
0502f4bc 7c3a2448 msvcr71!_nh_malloc(unsigned int size = <Memory access error>, int nhFlag = <Memory access error>)+0x24 [f:\vs70builds\3052\vc\crtbld\crt\src\malloc.c # 117]
0502f57c 7c3416b3 msvcp71!std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> >::_Tidy(bool _Built = <Memory access error>, unsigned int _Newsize = <Memory access error>)+0x45 [f:\vs70builds\3077\vc\crtbld\crt\src\xstring # 1520]
0502f610 7c3a32de msvcr71!_heap_alloc(unsigned int size = <Memory access error>)+0xe0 [f:\vs70builds\3052\vc\crtbld\crt\src\malloc.c # 212]
0502f620 7c3b3f63 msvcp71!wmemcpy(wchar_t * _S1 = 0x04e463b9 "Ҹ???", wchar_t * _S2 = 0xffffffff "--- memory read error at address 0xffffffff ---", unsigned int _N = 0x4e25212)+0x14 [f:\vs70builds\3077\vc\crtbld\crt\src\wchar.h # 843]
0502f640 04e463b9 msvcp71!std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> >::assign(class std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> > * _Right = 0xffffffff, unsigned int _Roff = 0x4e25212, unsigned int _Count = 2)+0x7c [f:\vs70builds\3077\vc\crtbld\crt\src\xstring # 601]
0502f770 04df1077 MyApp!DllMain+0x65329
0502f824 04e01b35 MyApp!DllMain+0xffe7
0502ff08 04dfe034 MyApp!DllMain+0x20aa5
0502ff48 04dfde4f MyApp!DllMain+0x1cfa4
0502ff88 7648d0e9 MyApp!DllMain+0x1cdbf
0502ffc4 773499f9 kernel32!BaseThreadInitThunk+0xe
0502ffd4 7738198e ntdll!RtlQueryInformationAcl+0x8b
0502ffec 00000000 ntdll!_RtlUserThreadStart+0x1b
the line I'm specifically trying to decode is the 'MyApp!DllMain+0x65329' as this is the last line that seems to be executing, and the error is occurring within the malloc call, which is apparently where the exception is being thrown from. What am I doing wrong that makes it only display the module and offset instead of source file and line number?
I'm also not sure why the line above the malloc call is back in MyApp again - maybe someone can explain that too.
Just in case, here's the output from 'analyze -v':
0:004> !analyze -v
*******************************************************************************
* *
* Exception Analysis *
* *
*******************************************************************************
*** WARNING: Unable to verify checksum for MyApp.exe
*** ERROR: Module load completed but symbols could not be loaded for MyApp.exe
*** WARNING: Unable to verify checksum for ThirdPartyDll.dll
*** ERROR: Symbol file could not be found. Defaulted to export symbols for ThirdPartyDll.dll -
*** WARNING: Unable to verify checksum for mdnsNSP.dll
*** ERROR: Symbol file could not be found. Defaulted to export symbols for mdnsNSP.dll -
*** ERROR: Symbol file could not be found. Defaulted to export symbols for SLC.dll -
FAULTING_IP:
MyApp!DllMain+3e8a6
04e1f936 8b16 mov edx,dword ptr [esi]
EXCEPTION_RECORD: ffffffff -- (.exr 0xffffffffffffffff)
ExceptionAddress: 04e1f936 (MyApp!DllMain+0x0003e8a6)
ExceptionCode: c0000005 (Access violation)
ExceptionFlags: 00000000
NumberParameters: 2
Parameter[0]: 00000000
Parameter[1]: 00000000
Attempt to read from address 00000000
PROCESS_NAME: MyApp.exe
ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at "0x%08lx" referenced memory at "0x%08lx". The memory could not be "%s".
EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at "0x%08lx" referenced memory at "0x%08lx". The memory could not be "%s".
EXCEPTION_PARAMETER1: 00000000
EXCEPTION_PARAMETER2: 00000000
READ_ADDRESS: 00000000
FOLLOWUP_IP:
msvcr71!_heap_alloc+e0 [f:\vs70builds\3052\vc\crtbld\crt\src\malloc.c # 212]
7c3416b3 e88e0c0000 call msvcr71!__SEH_epilog (7c342346)
NTGLOBALFLAG: 0
APPLICATION_VERIFIER_FLAGS: 0
LAST_CONTROL_TRANSFER: from 00000000 to 773bbb33
FAULTING_THREAD: ffffffff
BUGCHECK_STR: APPLICATION_FAULT_ACTIONABLE_HEAP_CORRUPTION_heap_failure_freelists_corruption_NULL_POINTER_READ_SHUTDOWN
PRIMARY_PROBLEM_CLASS: ACTIONABLE_HEAP_CORRUPTION_heap_failure_freelists_corruption_SHUTDOWN
DEFAULT_BUCKET_ID: ACTIONABLE_HEAP_CORRUPTION_heap_failure_freelists_corruption_SHUTDOWN
STACK_TEXT:
773bbb33 ntdll!RtlpAllocateHeap+0x7ad
773a6e0c ntdll!RtlAllocateHeap+0x1e3
7c3416b3 msvcr71!_heap_alloc+0xe0
FAULTING_SOURCE_CODE:
No source found for 'f:\vs70builds\3052\vc\crtbld\crt\src\malloc.c'
SYMBOL_STACK_INDEX: 2
SYMBOL_NAME: msvcr71!_heap_alloc+e0
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: msvcr71
IMAGE_NAME: msvcr71.dll
DEBUG_FLR_IMAGE_TIMESTAMP: 3e561eac
STACK_COMMAND: dds 7740c078 ; kb
FAILURE_BUCKET_ID: ACTIONABLE_HEAP_CORRUPTION_heap_failure_freelists_corruption_SHUTDOWN_c0000005_msvcr71.dll!_heap_alloc
BUCKET_ID: APPLICATION_FAULT_ACTIONABLE_HEAP_CORRUPTION_heap_failure_freelists_corruption_NULL_POINTER_READ_SHUTDOWN_msvcr71!_heap_alloc+e0
If you believe the PDB should be in your symbol path, you should run something like this:
!sym noisy
.reload MyApp.dll
kp
!sym noisy causes the debugger to give out more detailed information on why it couldn't load symbols - no MyApp.pdb found, found but does not match, etc. This will help you find out why it is not loading symbols. !sym noisy again turns off the verbose symbol output.
When you set the path for symbols, did you reload them?
.reload
I'm not sure your adding
srv*c:\symcache*C:\dev\Customer\MyAppSln\MyApp\Debug
to the symbol path has the desired effect.
I usually list all local paths in the .sympath first, and as the last step, I do .symfix+ to configure the public symbols using the microsoft symbol server:
.sympath C:\dev\Customer\MyAppSln\MyApp\Debug
.symfix+ c:\symcache
the rationale behind listing local paths first being that the debugger would not have to check the remote server for pdbs (that are not there anyways) as opposed to simply retrieving them locally.
Anyways, your problem is that the symbols for MyApp are not loaded therefore stack walking does not quite work.
Debugger walks the stack backwards, starting from the top, that's why you're seeing MyApp - this is where the access violation occurred.
Now, since debugger does not have the symbols at this point, it can only guess what invocation chain has led to the function on top.
And it guesses wrong by following a misleading path.

Resources