So I am trying to run my c++ application on an aarch64(ARM 8). ***When run using GDB the application runs without any problem. But otherwise it gives me a segmentation fault.***I checked dmesg and it goes as
unhandled level 3 permission fault (11) at 0x004ac010, esr 0x8300000f
[241808.064733] pgd = ffffffc0fe270000
[241808.068270] [004ac010] *pgd=00000001615c9003, *pmd=000000016f316003, *pte=02e0000147f42f53
[241808.076813]
[241808.076824] CPU: 2 PID: 12503 Comm: Jumpi Not tainted 3.10.67-g3a5c467 #1
[241808.076832] task: ffffffc0fef9c080 ti: ffffffc0f0fe4000 task.ti: ffffffc0f0fe4000
[241808.076841] PC is at 0x4ac010
[241808.076846] LR is at 0x401cb8
[241808.076852] pc : [<00000000004ac010>] lr : [<0000000000401cb8>] pstate: 20000000
[241808.076857] sp : 0000007fc044b600
[241808.076863] x29: 0000007fc044b680 x28: 0000000000000000
[241808.076873] x27: 0000000000000000 x26: 0000000000000000
[241808.076882] x25: 00000000004186ec x24: 0000000000418634
I tried set disable-randomization off in gdb but still no error.I then tried valgrind. I get a lot of error messages saying unitialised value was created ,mostly at dl_init_paths.But more importantly I get the bad permission generating SISGEV at a memory address which when i went through memory seems to be in (env_path_list) .
That where i am at after debugging for hours.If anyone has any suggestions/ideas about the next steps that would be helpful.
Another interesting fact is when the same code was compiled using a cross compiler and ran on this (ARM8) it works fine...!!
You can find detalied reason of fault in 'esr' register which already printed in crash dump. You can use armv8 spec to decode value of 'esr' register.
Related
My OS is Fedora 17. Recently, kernel tainted warning "kernel bug at kernel/auditsc.c:1772!-abrt" occurs:
This problem should not be reported (it is likely a known problem). A kernel problem occurred, but your kernel has been tainted (flags:GD). Kernel maintainers are unable to diagnose tainted reports.
Then, I get the following:
# cat /proc/sys/kernel/tainted
128
# dmesg | grep -i taint
[ 8306.955523] Pid: 4511, comm: chrome Tainted: G D 3.9.10-100.fc17.i686.PAE #1 Dell Inc.
[ 8307.366310] Pid: 4571, comm: chrome Tainted: G D 3.9.10-100.fc17.i686.PAE #1 Dell Inc.
It seems that the value "128" is much serious:
128 – The system has died.
How about this warning? Since chrome is flagged as the "Tainted" source, anybody also meet this matter?
To (over) simplify, 'tainted' means that the kernel is in a state other than what it would be in if it were built fresh from the open source origin and used in a way that it had been intended. It is a way of flagging a kernel to warn people (e.g., developers) that there may be unknown reasons for it to be unreliable, and that debugging it may be difficult or impossible.
In this case, 'GD' means that all modules are licensed as GPL or compatible (ie not proprietary), and that a crash or BUG() occurred.
The reasons are listed below:
See: oops-tracing.txt
---------------------------------------------------------------------------
Tainted kernels:
Some oops reports contain the string 'Tainted: ' after the program
counter. This indicates that the kernel has been tainted by some
mechanism. The string is followed by a series of position-sensitive
characters, each representing a particular tainted value.
1: 'G' if all modules loaded have a GPL or compatible license, 'P' if
any proprietary module has been loaded. Modules without a
MODULE_LICENSE or with a MODULE_LICENSE that is not recognised by
insmod as GPL compatible are assumed to be proprietary.
2: 'F' if any module was force loaded by "insmod -f", ' ' if all
modules were loaded normally.
3: 'S' if the oops occurred on an SMP kernel running on hardware that
hasn't been certified as safe to run multiprocessor.
Currently this occurs only on various Athlons that are not
SMP capable.
4: 'R' if a module was force unloaded by "rmmod -f", ' ' if all
modules were unloaded normally.
5: 'M' if any processor has reported a Machine Check Exception,
' ' if no Machine Check Exceptions have occurred.
6: 'B' if a page-release function has found a bad page reference or
some unexpected page flags.
7: 'U' if a user or user application specifically requested that the
Tainted flag be set, ' ' otherwise.
8: 'D' if the kernel has died recently, i.e. there was an OOPS or BUG.
9: 'A' if the ACPI table has been overridden.
10: 'W' if a warning has previously been issued by the kernel.
(Though some warnings may set more specific taint flags.)
11: 'C' if a staging driver has been loaded.
12: 'I' if the kernel is working around a severe bug in the platform
firmware (BIOS or similar).
13: 'O' if an externally-built ("out-of-tree") module has been loaded.
14: 'E' if an unsigned module has been loaded in a kernel supporting
module signature.
15: 'L' if a soft lockup has previously occurred on the system.
16: 'K' if the kernel has been live patched.
The primary reason for the 'Tainted: ' string is to tell kernel
debuggers if this is a clean kernel or if anything unusual has
occurred. Tainting is permanent: even if an offending module is
unloaded, the tainted value remains to indicate that the kernel is not
trustworthy.
Also showing numbers for the content of /proc/sys/kernel/tainted file:
Non-zero if the kernel has been tainted. Numeric values, which can be
ORed together. The letters are seen in "Tainted" line of Oops reports.
1 (P): A module with a non-GPL license has been loaded, this
includes modules with no license.
Set by modutils >= 2.4.9 and module-init-tools.
2 (F): A module was force loaded by insmod -f.
Set by modutils >= 2.4.9 and module-init-tools.
4 (S): Unsafe SMP processors: SMP with CPUs not designed for SMP.
8 (R): A module was forcibly unloaded from the system by rmmod -f.
16 (M): A hardware machine check error occurred on the system.
32 (B): A bad page was discovered on the system.
64 (U): The user has asked that the system be marked "tainted". This
could be because they are running software that directly modifies
the hardware, or for other reasons.
128 (D): The system has died.
256 (A): The ACPI DSDT has been overridden with one supplied by the user
instead of using the one provided by the hardware.
512 (W): A kernel warning has occurred.
1024 (C): A module from drivers/staging was loaded.
2048 (I): The system is working around a severe firmware bug.
4096 (O): An out-of-tree module has been loaded.
8192 (E): An unsigned module has been loaded in a kernel supporting module
signature.
16384 (L): A soft lockup has previously occurred on the system.
32768 (K): The kernel has been live patched.
65536 (X): Auxiliary taint, defined and used by for distros.
131072 (T): The kernel was built with the struct randomization plugin.
Source: https://www.kernel.org/doc/Documentation/sysctl/kernel.txt
Credit: https://askubuntu.com/questions/248470/what-does-the-kernel-taint-value-mean
I am using customized armada 370 board based on ARMv7.
I am able to successfully load the u-boot. But when loaded the linux kernel through "loadb command" directly into DRAM, i am getting below error.
Error:-
########################################
[ 0.400000] Unhandled fault: imprecise external abort (0x1406) at 0x00000000
[ 0.400000] Internal error: : 1406 [#1] PREEMPT
[ 0.400000] last sysfs file:
[ 0.400000] Modules linked in:
[ 0.400000] CPU: 0 Not tainted (2.6.34.10-WR4.3.0.0_standard #73)
[ 0.400000] PC is at trace_hardirqs_on+0x0/0x10
[ 0.400000] LR is at kernel_thread_helper+0x4/0x14
########################################
Below is the specification at which board is running.
CPU freq - 1000MHz
DDR & L2 cache freq - 667MHz
I am using DDR3 SDRAM
I am using linux kernel 2.6.34 marvel armada370 package from Windriver Linux.
I tried booting the same kernel image in the marvel reference board and it is working fine.
I read in some article, this errors are related to RAM.
But in u-boot, i am able to do successfull read and write operations.
I analysed the log and i found that the value 0x1406 specifies the Data fault status register.
In this article, i decoded the value and the error is pointing to AXI slave read error.
Can you help why i am getting this error.
Thanks in advance.
Thanks & Regards
Shamshad
The gist of the problem is : What are the possibilities of a user-land app getting corrupted while it is running ? Other than hardware failures.
Hardware rig : ARM9 (at91sam9xe)
NAND Flash for :Linux kernel + FS + userland app.
We had an app running on embedded linux on ARM9 (at91sam9xe ), there were no problems for a couple of months but then suddenly an ARM reported being unable to execute the app..
When it was executed it crashed with the following dump :
pgd = c16b8000
[00000020] *pgd=215a0031, *pte=00000000, *ppte=00000000
Pid: 349, comm: console
CPU: 0 Not tainted (2.6.30.4-uc0 #280)
PC is at 0x4e000
LR is at 0x673e0
pc : [<0004e000>] lr : [<000673e0>] psr: 60000010
sp : bec6a728 ip : bec6acb4 fp : bec6ac9c
r10: 000bd9f8 r9 : 00000000 r8 : 00000000
r7 : 00000000 r6 : bec6acb4 r5 : 00000000 r4 : fbad2084
r3 : ffffffff r2 : bec6acb4 r1 : 00000025 r0 : 0009eab0
Flags: nZCv IRQs on FIQs on Mode USER_32 ISA ARM Segment user
Control: 0005317f Table: 216b8000 DAC: 00000015
[<c02ec3b0>] (show_regs+0x0/0x50) from [<c02f11a8>] (__do_user_fault+0x9c/0xa8)
r5:0000000b r4:c1696360
[<c02f110c>] (__do_user_fault+0x0/0xa8) from [<c02f1344>] (do_page_fault+0x114/0x244)
r7:00010000 r6:c1696360 r5:c15a62e0 r4:c1c5fde0
[<c02f1230>] (do_page_fault+0x0/0x244) from [<c02ea284>] (do_DataAbort+0x3c/0xa0)
[<c02ea248>] (do_DataAbort+0x0/0xa0) from [<c02eae00>] (ret_from_exception+0x0/0x10)
Exception stack(0xc1683fb0 to 0xc1683ff8)
3fa0: 0009eab0 00000025 bec6acb4 ffffffff
3fc0: fbad2084 00000000 bec6acb4 00000000 00000000 00000000 000bd9f8 bec6ac9c
3fe0: bec6acb4 bec6a728 000673e0 0004e000 60000010 ffffffff
I tried addr2line to see where it crashed but it gave reference to crtstuff.c =\ crtstuff.c is not a part of our app, its related to GCC i think.
I feared corruption of my executable, so i ran a diff on the file on NAND and file from my PC... there were differences which shouldn't happen. Plus, the differences were almost all of them as "0x00" values instead of the value they should contain.
What I really want to know is , how can a userland app get corrupted other than the hardware failures ?
Cause:
NAND flash was always writeable , so what we hypohtesized was that there is a coincidence where things are being written to flash and power goes out .
Solution
Moved our FS to RAM, we only mount part of NAND partition as writeable only when there is a need to write something. NAND write protect was controlled via Hardware Pin to only enable when there is a write-request from App
I'm trying to start kernel debugging with this sytem:
Amontec JTAGkey2, openocd, gdb, eclipse.
At the end I would like to debug kernel and application that is running within.
I have few problems, and it seems that I need to solve them sequently.
Now I have CPU suspend/resume, read/write RAM
What is missing: Step into, Step over, C/C++ Level debugging.
I do following:
- Connect JTAG, Power up board, start uImage with Debug messages via Uboot
- start openocd:
# openocd -f /usr/share/openocd/scripts/interface/jtagkey2.cfg -f /usr/share/openocd/scripts/board/at91sam9g20-ek.cfg
Output:
jtag_nsrst_delay: 200
jtag_ntrst_delay: 200
RCLK - adaptive
TapName | Enabled | IdCode Expected IrLen IrCap IrMask Instr
---|--------------------|---------|------------|------------|------|------|------|---------
0 | at91sam9g20.cpu | Y | 0x00000000 | 0x0792603f | 0x04 | 0x01 | 0x0f | 0x0f
Info : max TCK change to: 30000 kHz
Info : RCLK (adaptive clock speed)
Info : JTAG tap: at91sam9g20.cpu tap/device found: 0x0792603f (mfg: 0x01f, part: 0x7926, ver: 0x0)
Info : Embedded ICE version 6
And problems starts here:
openocd:
Warn : acknowledgment received, but no packet pending
undefined debug reason 6 - target needs reset
Warn : target not halted
eclipse:
symbol-file /opt/Tixi_Repos/KiwiG6v2/buildroot-2011.05/package_tixi/linux-2.6.39/arch/arm/boot/compressed/vmlinux
target remote localhost:3333
start () at arch/arm/boot/compressed/head.S:108
108 kphex r5, 8 /* end of kernel */
It seems also that JTAG is trying to load the code into 0x0, what is incorrect I suppose:
Update 1:
After analyzing some online tutorials for ARM:
Eclipse Reset and Halt commands doesn't work perfect. It is better to uncheck them and write into command window. Also load address can be add:
monitor halt
load arch/arm/boot/compressed/vmlinux 0x22000000
I don't use
monitor reset
I let Uboot start and initialize RAM and other peripherals. Then I stop Uboot by getting into shell. Then I let eclipse write linux into RAM, and start it. It takes very long, but works bit better. Kernel starts and stopps on RPC initialization without giving console back.
would it be possible to load kernel into RAM within Uboot console, and start JTAG session afterwards ?
What is the difference between [load ...] and [monitor load...] commands
Why do I need to load /compressed/vmlinux instead of uImage ?
in eclipse window I have two load fields: load image i load symbol. I disable both options but write only load arch/arm/boot/compressed/vmlinux 0x22000000. Is it maybe the reason for next problems ?
Update 2:
Ok. Thank you for hints.
I've made some progress. Could you give me some advices, maybe I'm still doing something wrong.
Now my kernel runs under JTAG control, but I still can't debug on source code level.
I do as follows:
Power up the board, go into uboot shell.
start openOCD session
Set Uboot breakpoint in bootm.c on theKernel call:
cleanup_before_linux ();
theKernel (0, machid, bd->bi_boot_params);
start eclipse debug session :
monitor halt
load uboot-a without offset
load u-boot-2010.06/u-boot
Loading section .text, size 0x349ec lma 0x26f00000
start uboot and let it run
uboot stopps on "theKernel" call
I know that kernel is located on address 0x20008000.
restart openOCD session
start ecipse debugger once more with kernel configuration:
monitor halt
load kernel on address 0x20008000
load arch/arm/boot/compressed/vmlinux 0x20008000
Loading section .text, size 0x8bdc7c lma 0x20008000
start debugg session
Everything works fine now, and kernel starts, but I still can't debug on source code level.
"symbol is not available"
DEBUG and DEBUG_INFO are on for kernel.
vmlinux screenshot
What seems starnge for me that there are around 50 function symbols in this file.
First let me say I am a total WinDbg noob, so this might be an easy question...
I have an application ("MyApp" - name changed to protect the innocent!) that I am trying to debug because it is throwing an exception. This only happens on user machines - I have not been able to reproduce it on my development machine. So I set up DebugDiag on the users machine and captured a Full Dump. Then I loaded the dump in WinDbg and did an analyze -v and a kp to try to figure out what was going on... but neither of these seem to give me the information that I'm looking for - the function (and hopefully the line number) of the line that is causing the problem... I think I have the symbol file loaded by specifying the path to 'MyApp.pdb' in the Symbol File Path:
srv*c:\symcache*http://msdl.microsoft.com/download/symbols;srv*c:\symcache*C:\dev\Customer\MyAppSln\MyApp\Debug
First, here's the output from kp:
0:004> kp
ChildEBP RetAddr
WARNING: Stack unwind information not available. Following frames may be wrong.
0502f474 7c347966 MyApp!DllMain+0x3e8a6
0502f4bc 7c3a2448 msvcr71!_nh_malloc(unsigned int size = <Memory access error>, int nhFlag = <Memory access error>)+0x24 [f:\vs70builds\3052\vc\crtbld\crt\src\malloc.c # 117]
0502f57c 7c3416b3 msvcp71!std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> >::_Tidy(bool _Built = <Memory access error>, unsigned int _Newsize = <Memory access error>)+0x45 [f:\vs70builds\3077\vc\crtbld\crt\src\xstring # 1520]
0502f610 7c3a32de msvcr71!_heap_alloc(unsigned int size = <Memory access error>)+0xe0 [f:\vs70builds\3052\vc\crtbld\crt\src\malloc.c # 212]
0502f620 7c3b3f63 msvcp71!wmemcpy(wchar_t * _S1 = 0x04e463b9 "Ҹ???", wchar_t * _S2 = 0xffffffff "--- memory read error at address 0xffffffff ---", unsigned int _N = 0x4e25212)+0x14 [f:\vs70builds\3077\vc\crtbld\crt\src\wchar.h # 843]
0502f640 04e463b9 msvcp71!std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> >::assign(class std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> > * _Right = 0xffffffff, unsigned int _Roff = 0x4e25212, unsigned int _Count = 2)+0x7c [f:\vs70builds\3077\vc\crtbld\crt\src\xstring # 601]
0502f770 04df1077 MyApp!DllMain+0x65329
0502f824 04e01b35 MyApp!DllMain+0xffe7
0502ff08 04dfe034 MyApp!DllMain+0x20aa5
0502ff48 04dfde4f MyApp!DllMain+0x1cfa4
0502ff88 7648d0e9 MyApp!DllMain+0x1cdbf
0502ffc4 773499f9 kernel32!BaseThreadInitThunk+0xe
0502ffd4 7738198e ntdll!RtlQueryInformationAcl+0x8b
0502ffec 00000000 ntdll!_RtlUserThreadStart+0x1b
the line I'm specifically trying to decode is the 'MyApp!DllMain+0x65329' as this is the last line that seems to be executing, and the error is occurring within the malloc call, which is apparently where the exception is being thrown from. What am I doing wrong that makes it only display the module and offset instead of source file and line number?
I'm also not sure why the line above the malloc call is back in MyApp again - maybe someone can explain that too.
Just in case, here's the output from 'analyze -v':
0:004> !analyze -v
*******************************************************************************
* *
* Exception Analysis *
* *
*******************************************************************************
*** WARNING: Unable to verify checksum for MyApp.exe
*** ERROR: Module load completed but symbols could not be loaded for MyApp.exe
*** WARNING: Unable to verify checksum for ThirdPartyDll.dll
*** ERROR: Symbol file could not be found. Defaulted to export symbols for ThirdPartyDll.dll -
*** WARNING: Unable to verify checksum for mdnsNSP.dll
*** ERROR: Symbol file could not be found. Defaulted to export symbols for mdnsNSP.dll -
*** ERROR: Symbol file could not be found. Defaulted to export symbols for SLC.dll -
FAULTING_IP:
MyApp!DllMain+3e8a6
04e1f936 8b16 mov edx,dword ptr [esi]
EXCEPTION_RECORD: ffffffff -- (.exr 0xffffffffffffffff)
ExceptionAddress: 04e1f936 (MyApp!DllMain+0x0003e8a6)
ExceptionCode: c0000005 (Access violation)
ExceptionFlags: 00000000
NumberParameters: 2
Parameter[0]: 00000000
Parameter[1]: 00000000
Attempt to read from address 00000000
PROCESS_NAME: MyApp.exe
ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at "0x%08lx" referenced memory at "0x%08lx". The memory could not be "%s".
EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at "0x%08lx" referenced memory at "0x%08lx". The memory could not be "%s".
EXCEPTION_PARAMETER1: 00000000
EXCEPTION_PARAMETER2: 00000000
READ_ADDRESS: 00000000
FOLLOWUP_IP:
msvcr71!_heap_alloc+e0 [f:\vs70builds\3052\vc\crtbld\crt\src\malloc.c # 212]
7c3416b3 e88e0c0000 call msvcr71!__SEH_epilog (7c342346)
NTGLOBALFLAG: 0
APPLICATION_VERIFIER_FLAGS: 0
LAST_CONTROL_TRANSFER: from 00000000 to 773bbb33
FAULTING_THREAD: ffffffff
BUGCHECK_STR: APPLICATION_FAULT_ACTIONABLE_HEAP_CORRUPTION_heap_failure_freelists_corruption_NULL_POINTER_READ_SHUTDOWN
PRIMARY_PROBLEM_CLASS: ACTIONABLE_HEAP_CORRUPTION_heap_failure_freelists_corruption_SHUTDOWN
DEFAULT_BUCKET_ID: ACTIONABLE_HEAP_CORRUPTION_heap_failure_freelists_corruption_SHUTDOWN
STACK_TEXT:
773bbb33 ntdll!RtlpAllocateHeap+0x7ad
773a6e0c ntdll!RtlAllocateHeap+0x1e3
7c3416b3 msvcr71!_heap_alloc+0xe0
FAULTING_SOURCE_CODE:
No source found for 'f:\vs70builds\3052\vc\crtbld\crt\src\malloc.c'
SYMBOL_STACK_INDEX: 2
SYMBOL_NAME: msvcr71!_heap_alloc+e0
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: msvcr71
IMAGE_NAME: msvcr71.dll
DEBUG_FLR_IMAGE_TIMESTAMP: 3e561eac
STACK_COMMAND: dds 7740c078 ; kb
FAILURE_BUCKET_ID: ACTIONABLE_HEAP_CORRUPTION_heap_failure_freelists_corruption_SHUTDOWN_c0000005_msvcr71.dll!_heap_alloc
BUCKET_ID: APPLICATION_FAULT_ACTIONABLE_HEAP_CORRUPTION_heap_failure_freelists_corruption_NULL_POINTER_READ_SHUTDOWN_msvcr71!_heap_alloc+e0
If you believe the PDB should be in your symbol path, you should run something like this:
!sym noisy
.reload MyApp.dll
kp
!sym noisy causes the debugger to give out more detailed information on why it couldn't load symbols - no MyApp.pdb found, found but does not match, etc. This will help you find out why it is not loading symbols. !sym noisy again turns off the verbose symbol output.
When you set the path for symbols, did you reload them?
.reload
I'm not sure your adding
srv*c:\symcache*C:\dev\Customer\MyAppSln\MyApp\Debug
to the symbol path has the desired effect.
I usually list all local paths in the .sympath first, and as the last step, I do .symfix+ to configure the public symbols using the microsoft symbol server:
.sympath C:\dev\Customer\MyAppSln\MyApp\Debug
.symfix+ c:\symcache
the rationale behind listing local paths first being that the debugger would not have to check the remote server for pdbs (that are not there anyways) as opposed to simply retrieving them locally.
Anyways, your problem is that the symbols for MyApp are not loaded therefore stack walking does not quite work.
Debugger walks the stack backwards, starting from the top, that's why you're seeing MyApp - this is where the access violation occurred.
Now, since debugger does not have the symbols at this point, it can only guess what invocation chain has led to the function on top.
And it guesses wrong by following a misleading path.