Windows Service crashes - windows

I have a windows service running on Windows Server 2003. The service crashes ocassionlly. I tried analysis error through DebugDiag and got following information. Can anyone help me to understand the problem. The service uses GSM libaray for SMS service.
Error:
In EnfieldSMSService__PID__1764__Date__01_21_2011__Time_09_34_31AM__554__Second_Chance_Exception_C0000005.
dmp the assembly instruction at 0x00bb29d1 which does not correspond to any
known native module in the process has caused an access violation exception
(0xC0000005) when trying to read from memory location 0x00000008 on thread 10
Report for
>EnfieldSMSService__PID__1764__Date__01_21_2011__Time_09_34_31AM__554__Second_Chance_Exception_C0000005.
dmp
Type of Analysis Performed Crash Analysis
Machine Name REALTRACSERVER
Operating System Windows Server 2003 Service Pack 2
Number Of Processors 4
Process ID 1764
Process Image C:\Program Files\Default Company Name\EnfieldSMSSetUp\
EnfieldSMSService.exe
System Up-Time 7 day(s) 18:30:55
Process Up-Time 00:05:24
Thread 10 - System ID 3632
Entry point mscorwks!CreateApplicationContext+bbef
Create time 1/21/2011 9:34:31 AM
Time spent in user mode 0 Days 0:0:0.0
Time spent in kernel mode 0 Days 0:0:0.0
Function Arg 1 Arg 2 Arg 3 Source
0x00bb29d1 01041e54 00b8f8c4 792d6cf6
0x00bb11c7 01041ee0 00b8f8d8 792e019f
mscorlib_ni+216cf6 00b8f91c 01041ee0 01041eac
mscorlib_ni+22019f 01041eac 00000000 001df020
mscorlib_ni+216c74 00000217 7c82b02a 00b8f980
mscorwks+1b4c 00b8f9d0 00000000 00b8f9a0
mscorwks!DllUnregisterServerInternal+6195 00b8f9d0 00000000
00b8f9a0
mscorwks!CoUninitializeEE+2e95 7924290c 00b8fc14 00b8fb4c
mscorwks!CoUninitializeEE+2ec8 7924290c 00b8fc14 00b8fb4c
mscorwks!CoUninitializeEE+2ee6 00b8fb4c 95f5cb6e 001df020
mscorwks!CorExitProcess+1e4d 00b8fe50 00000000 00000000
mscorwks!CoUninitializeEE+4df3 00b8fdc4 00b8fd70 79f7759b
mscorwks!CoUninitializeEE+4d8f 00b8fdc4 95f5ca02 00000000
mscorwks!CoUninitializeEE+4cb5 00b8fdc4 00000001 00000000
mscorwks!CoUninitializeEE+4e41 00000001 79f3d6e9 00b8fe50
mscorwks!CorExitProcess+1c1e 00000001 79f3d6e9 00b8fe50
mscorwks!CorExitProcess+1cf8 001e2f68 00000001 00000001
mscorwks!CreateApplicationContext+bc35 0019cf88 00000000 00000000
kernel32!GetModuleHandleA+df 79f91fcf 0019cf88 00000000
Detailed stack corruption analysis for thread 10
Call stack with StackWalk
Index Return Address
1 0x00bb29d1
2 0x00bb11c7
3 mscorlib_ni+216cf6
4 mscorlib_ni+22019f
5 mscorlib_ni+216c74
6 mscorwks+1b4c
7 mscorwks!DllUnregisterServerInternal+6195
8 mscorwks!CoUninitializeEE+2e95
9 mscorwks!CoUninitializeEE+2ec8
10 mscorwks!CoUninitializeEE+2ee6
11 mscorwks!CorExitProcess+1e4d
12 mscorwks!CoUninitializeEE+4df3
13 mscorwks!CoUninitializeEE+4d8f
14 mscorwks!CoUninitializeEE+4cb5
15 mscorwks!CoUninitializeEE+4e41
16 mscorwks!CorExitProcess+1c1e
17 mscorwks!CorExitProcess+1cf8
18 mscorwks!CreateApplicationContext+bc35
19 kernel32!GetModuleHandleA+df
Call stack - Heuristic
Index Stack Address Child EBP Return Address Destination
1 0x00000000 0x00b8f8ac 0x00bb29d1 0x00000000
2 0x00b8f8b0 0x00b8f8b8 0x00bb11c7 0x00bb2940
3 0x00b8f8bc 0x00b8f8c4 mscorlib_ni+216cf6 0x00000000
4 0x00b8f8c8 0x00b8f8d8 mscorlib_ni+22019f 0x00000000
5 0x00b8f8dc 0x00b8f8f0 mscorlib_ni+216c74 mscorlib_ni+220130
6 0x00b8f8f4 0x00b8f900 mscorwks+1b4c 0x00000000
7 0x00b8f904 0x00b8f980 mscorwks!DllUnregisterServerInternal+6195
mscorwks+1b19
8 0x00b8f984 0x00b8fab8 mscorwks!CoUninitializeEE+2e95 mscorwks!
DllUnregisterServerInternal+60f6
9 0x00b8fabc 0x00b8fad4 mscorwks!CoUninitializeEE+2ec8 mscorwks!
CoUninitializeEE+2d3b
10 0x00b8fad8 0x00b8faec mscorwks!CoUninitializeEE+2ee6 mscorwks!
CoUninitializeEE+2ea9
11 0x00b8faf0 0x00b8fcd4 mscorwks!CorExitProcess+1e4d mscorwks!
CoUninitializeEE+2ecc
12 0x00b8fcd8 0x00b8fce8 mscorwks!CoUninitializeEE+4df3 0x00000000
13 0x00b8fcec 0x00b8fd7c mscorwks!CoUninitializeEE+4d8f mscorwks!
CoUninitializeEE+4dc8
14 0x00b8fd80 0x00b8fdb8 mscorwks!CoUninitializeEE+4cb5 mscorwks!
CoUninitializeEE+4ce0
15 0x00b8fdbc 0x00b8fde0 mscorwks!CoUninitializeEE+4e41 mscorwks!
CoUninitializeEE+4c90
16 0x00b8fde4 0x00b8fdf8 mscorwks!CorExitProcess+1c1e mscorwks!
CoUninitializeEE+4e1c
17 0x00b8fdfc 0x00b8fe94 mscorwks!CorExitProcess+1cf8 mscorwks!
CorExitProcess+1c0b
18 0x00b8fe98 0x00b8ffb8 mscorwks!CreateApplicationContext+bc35 0x00000000
19 0x00b8ffbc 0x00b8ffec kernel32!GetModuleHandleA+df 0x00000000
792d6cf4 ffd0 call eax
792e019d ffd0 call eax
792d6c6f e8bc940000 call mscorlib_ni+0x220130 (792e0130)
79e71b49 ff5518 call dword ptr [ebp+18h]
79e821ac e868f9feff call mscorwks+0x1b19 (79e71b19)
79e964fc e811bcfeff call mscorwks!DllUnregisterServerInternal+0x60f6
(79e82112)
79e9652f e873feffff call mscorwks!CoUninitializeEE+0x2d3b (79e963a7)
79e9654d e8c3ffffff call mscorwks!CoUninitializeEE+0x2ea9 (79e96515)
79f3d7fe e8358df5ff call mscorwks!CoUninitializeEE+0x2ecc (79e96538)
79e9845c ff560c call dword ptr [esi+0Ch]
79e983f6 e839000000 call mscorwks!CoUninitializeEE+0x4dc8 (79e98434)
79e9831c e82b000000 call mscorwks!CoUninitializeEE+0x4ce0 (79e9834c)
79e984a8 e84ffeffff call mscorwks!CoUninitializeEE+0x4c90 (79e982fc)
79f3d5cf e8b4aef5ff call mscorwks!CoUninitializeEE+0x4e1c (79e98488)
79f3d6a9 e813ffffff call mscorwks!CorExitProcess+0x1c0b (79f3d5c1)
79f92013 ffd6 call esi
77e64826 ff5508 call dword ptr [ebp+8]
In EnfieldSMSService__PID__1764__Date__01_21_2011__Time_09_34_31AM__554__Second_Chance_Exception_C0000005.
dmp the assembly instruction at 0x00bb29d1 which does not correspond to any
known native module in the process has caused an access violation exception
(0xC0000005) when trying to read from memory location 0x00000008 on thread 10

Related

DPC_WATCHDOG_VIOLATION (133/1) Potentially related to NdisFIndicateReceiveNetBufferLists?

We have a NDIS LWF driver, and on a single machine we get a DPC_WATCHDOG_VIOLATION 133/1 bugcheck when they try to connect to their VPN to connect to the internet. This could be related to our NdisFIndicateReceiveNetBufferLists, as the IRQL is raised to DISPATCH before calling it (and obviously lowered to whatever it was afterward), and that does appear in the output of !dpcwatchdog shown below. This is done due to a workaround for another bug explained here:
IRQL_UNEXPECTED_VALUE BSOD after NdisFIndicateReceiveNetBufferLists?
Now this is the bugcheck:
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************
DPC_WATCHDOG_VIOLATION (133)
The DPC watchdog detected a prolonged run time at an IRQL of DISPATCH_LEVEL
or above.
Arguments:
Arg1: 0000000000000001, The system cumulatively spent an extended period of time at
DISPATCH_LEVEL or above. The offending component can usually be
identified with a stack trace.
Arg2: 0000000000001e00, The watchdog period.
Arg3: fffff805422fb320, cast to nt!DPC_WATCHDOG_GLOBAL_TRIAGE_BLOCK, which contains
additional information regarding the cumulative timeout
Arg4: 0000000000000000
STACK_TEXT:
nt!KeBugCheckEx
nt!KeAccumulateTicks+0x1846b2
nt!KiUpdateRunTime+0x5d
nt!KiUpdateTime+0x4a1
nt!KeClockInterruptNotify+0x2e3
nt!HalpTimerClockInterrupt+0xe2
nt!KiCallInterruptServiceRoutine+0xa5
nt!KiInterruptSubDispatchNoLockNoEtw+0xfa
nt!KiInterruptDispatchNoLockNoEtw+0x37
nt!KxWaitForSpinLockAndAcquire+0x2c
nt!KeAcquireSpinLockAtDpcLevel+0x5c
wanarp!WanNdisReceivePackets+0x4bb
ndis!ndisMIndicateNetBufferListsToOpen+0x141
ndis!ndisMTopReceiveNetBufferLists+0x3f0e4
ndis!ndisCallReceiveHandler+0x61
ndis!ndisInvokeNextReceiveHandler+0x1df
ndis!NdisMIndicateReceiveNetBufferLists+0x104
ndiswan!IndicateRecvPacket+0x596
ndiswan!ApplyQoSAndIndicateRecvPacket+0x20b
ndiswan!ProcessPPPFrame+0x16f
ndiswan!ReceivePPP+0xb3
ndiswan!ProtoCoReceiveNetBufferListChain+0x442
ndis!ndisMCoIndicateReceiveNetBufferListsToNetBufferLists+0xf6
ndis!NdisMCoIndicateReceiveNetBufferLists+0x11
raspptp!CallIndicateReceived+0x210
raspptp!CallProcessRxNBLs+0x199
ndis!ndisDispatchIoWorkItem+0x12
nt!IopProcessWorkItem+0x135
nt!ExpWorkerThread+0x105
nt!PspSystemThreadStartup+0x55
nt!KiStartSystemThread+0x28
SYMBOL_NAME: wanarp!WanNdisReceivePackets+4bb
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: wanarp
IMAGE_NAME: wanarp.sys
And this following is the output of !dpcwatchdog, but I still can't find what is causing this bugcheck, and can't find which function is consuming too much time in DISPATCH level which is causing this bugcheck. Although I think this could be related to some spin locking done by wanarp? Could this be a bug with wanarp? Note that we don't use any spinlocking in our driver, and us raising the IRQL should not cause any issue as it is actually very common for indication in Ndis to be done at IRQL DISPATCH.
So How can I find the root cause of this bugcheck? There are no other third party LWF in the ndis stack.
3: kd> !dpcwatchdog
All durations are in seconds (1 System tick = 15.625000 milliseconds)
Circular Kernel Context Logger history: !logdump 0x2
DPC and ISR stats: !intstats /d
--------------------------------------------------
CPU#0
--------------------------------------------------
Current DPC: No Active DPC
Pending DPCs:
----------------------------------------
CPU Type KDPC Function
dpcs: no pending DPCs found
--------------------------------------------------
CPU#1
--------------------------------------------------
Current DPC: No Active DPC
Pending DPCs:
----------------------------------------
CPU Type KDPC Function
1: Normal : 0xfffff80542220e00 0xfffff805418dbf10 nt!PpmCheckPeriodicStart
1: Normal : 0xfffff80542231d40 0xfffff8054192c730 nt!KiBalanceSetManagerDeferredRoutine
1: Normal : 0xffffbd0146590868 0xfffff80541953200 nt!KiEntropyDpcRoutine
DPC Watchdog Captures Analysis for CPU #1.
DPC Watchdog capture size: 641 stacks.
Number of unique stacks: 1.
No common functions detected!
The captured stacks seem to indicate that only a single DPC or generic function is the culprit.
Try to analyse what other processors were doing at the time of the following reference capture:
CPU #1 DPC Watchdog Reference Stack (#0 of 641) - Time: 16 Min 17 Sec 984.38 mSec
# RetAddr Call Site
00 fffff805418d8991 nt!KiUpdateRunTime+0x5D
01 fffff805418d2803 nt!KiUpdateTime+0x4A1
02 fffff805418db1c2 nt!KeClockInterruptNotify+0x2E3
03 fffff80541808a45 nt!HalpTimerClockInterrupt+0xE2
04 fffff805419fab9a nt!KiCallInterruptServiceRoutine+0xA5
05 fffff805419fb107 nt!KiInterruptSubDispatchNoLockNoEtw+0xFA
06 fffff805418a9a9c nt!KiInterruptDispatchNoLockNoEtw+0x37
07 fffff805418da3cc nt!KxWaitForSpinLockAndAcquire+0x2C
08 fffff8054fa614cb nt!KeAcquireSpinLockAtDpcLevel+0x5C
09 fffff80546ba1eb1 wanarp!WanNdisReceivePackets+0x4BB
0a fffff80546be0b84 ndis!ndisMIndicateNetBufferListsToOpen+0x141
0b fffff80546ba7ef1 ndis!ndisMTopReceiveNetBufferLists+0x3F0E4
0c fffff80546bddfef ndis!ndisCallReceiveHandler+0x61
0d fffff80546ba4a94 ndis!ndisInvokeNextReceiveHandler+0x1DF
0e fffff8057c32d17e ndis!NdisMIndicateReceiveNetBufferLists+0x104
0f fffff8057c30d6c7 ndiswan!IndicateRecvPacket+0x596
10 fffff8057c32d56b ndiswan!ApplyQoSAndIndicateRecvPacket+0x20B
11 fffff8057c32d823 ndiswan!ProcessPPPFrame+0x16F
12 fffff8057c308e62 ndiswan!ReceivePPP+0xB3
13 fffff80546c5c006 ndiswan!ProtoCoReceiveNetBufferListChain+0x442
14 fffff80546c5c2d1 ndis!ndisMCoIndicateReceiveNetBufferListsToNetBufferLists+0xF6
15 fffff8057c2b0064 ndis!NdisMCoIndicateReceiveNetBufferLists+0x11
16 fffff8057c2b06a9 raspptp!CallIndicateReceived+0x210
17 fffff80546bd9dc2 raspptp!CallProcessRxNBLs+0x199
18 fffff80541899645 ndis!ndisDispatchIoWorkItem+0x12
19 fffff80541852b65 nt!IopProcessWorkItem+0x135
1a fffff80541871d25 nt!ExpWorkerThread+0x105
1b fffff80541a00778 nt!PspSystemThreadStartup+0x55
1c ---------------- nt!KiStartSystemThread+0x28
--------------------------------------------------
CPU#2
--------------------------------------------------
Current DPC: No Active DPC
Pending DPCs:
----------------------------------------
CPU Type KDPC Function
2: Normal : 0xffffbd01467f0868 0xfffff80541953200 nt!KiEntropyDpcRoutine
DPC Watchdog Captures Analysis for CPU #2.
DPC Watchdog capture size: 641 stacks.
Number of unique stacks: 1.
No common functions detected!
The captured stacks seem to indicate that only a single DPC or generic function is the culprit.
Try to analyse what other processors were doing at the time of the following reference capture:
CPU #2 DPC Watchdog Reference Stack (#0 of 641) - Time: 16 Min 17 Sec 984.38 mSec
# RetAddr Call Site
00 fffff805418d245a nt!KeClockInterruptNotify+0x453
01 fffff80541808a45 nt!HalpTimerClockIpiRoutine+0x1A
02 fffff805419fab9a nt!KiCallInterruptServiceRoutine+0xA5
03 fffff805419fb107 nt!KiInterruptSubDispatchNoLockNoEtw+0xFA
04 fffff805418a9a9c nt!KiInterruptDispatchNoLockNoEtw+0x37
05 fffff805418a9a68 nt!KxWaitForSpinLockAndAcquire+0x2C
06 fffff8054fa611cb nt!KeAcquireSpinLockRaiseToDpc+0x88
07 fffff80546ba1eb1 wanarp!WanNdisReceivePackets+0x1BB
08 fffff80546be0b84 ndis!ndisMIndicateNetBufferListsToOpen+0x141
09 fffff80546ba7ef1 ndis!ndisMTopReceiveNetBufferLists+0x3F0E4
0a fffff80546bddfef ndis!ndisCallReceiveHandler+0x61
0b fffff80546be3a81 ndis!ndisInvokeNextReceiveHandler+0x1DF
0c fffff80546ba804e ndis!ndisFilterIndicateReceiveNetBufferLists+0x3C611
0d fffff8054e384d77 ndis!NdisFIndicateReceiveNetBufferLists+0x6E
0e fffff8054e3811a9 ourdriver+0x4D70
0f fffff80546ba7d40 ourdriver+0x11A0
10 fffff8054182a6b5 ndis!ndisDummyIrpHandler+0x100
11 fffff80541c164c8 nt!IofCallDriver+0x55
12 fffff80541c162c7 nt!IopSynchronousServiceTail+0x1A8
13 fffff80541c15646 nt!IopXxxControlFile+0xC67
14 fffff80541a0aab5 nt!NtDeviceIoControlFile+0x56
15 ---------------- nt!KiSystemServiceCopyEnd+0x25
--------------------------------------------------
CPU#3
--------------------------------------------------
Current DPC: No Active DPC
Pending DPCs:
----------------------------------------
CPU Type KDPC Function
dpcs: no pending DPCs found
Target machine version: Windows 10 Kernel Version 19041 MP (4 procs)
Also note that we also pass the NDIS_RECEIVE_FLAGS_DISPATCH_LEVEL flag to the NdisFIndicateReceiveNetBufferLists, if the current IRQL is dispatch.
Edit1:
This is also the output of !locks and !qlocks and !ready, And the contention count on one of the resources is 49135, is this normal or too high? Could this be related to our issue? The threads that are waiting on it or own it are for normal processes such as chrome, csrss, etc.
3: kd> !kdexts.locks
**** DUMP OF ALL RESOURCE OBJECTS ****
KD: Scanning for held locks.
Resource # nt!ExpTimeRefreshLock (0xfffff80542219440) Exclusively owned
Contention Count = 17
Threads: ffffcf8ce9dee640-01<*>
KD: Scanning for held locks.....
Resource # 0xffffcf8cde7f59f8 Shared 1 owning threads
Contention Count = 62
Threads: ffffcf8ce84ec080-01<*>
KD: Scanning for held locks...............................................................................................
Resource # 0xffffcf8ce08d0890 Exclusively owned
Contention Count = 49135
NumberOfSharedWaiters = 1
NumberOfExclusiveWaiters = 6
Threads: ffffcf8cf18e3080-01<*> ffffcf8ce3faf080-01
Threads Waiting On Exclusive Access:
ffffcf8ceb6ce080 ffffcf8ce1d20080 ffffcf8ce77f1080 ffffcf8ce92f4080
ffffcf8ce1d1f0c0 ffffcf8ced7c6080
KD: Scanning for held locks.
Resource # 0xffffcf8ce08d0990 Shared 1 owning threads
Threads: ffffcf8cf18e3080-01<*>
KD: Scanning for held locks.........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
Resource # 0xffffcf8ceff46350 Shared 1 owning threads
Threads: ffffcf8ce6de8080-01<*>
KD: Scanning for held locks......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
Resource # 0xffffcf8cf0cade50 Exclusively owned
Contention Count = 3
Threads: ffffcf8ce84ec080-01<*>
KD: Scanning for held locks.........................
Resource # 0xffffcf8cf0f76180 Shared 1 owning threads
Threads: ffffcf8ce83dc080-02<*>
KD: Scanning for held locks.......................................................................................................................................................................................................................................................
Resource # 0xffffcf8cf1875cb0 Shared 1 owning threads
Contention Count = 3
Threads: ffffcf8ce89db040-02<*>
KD: Scanning for held locks.
Resource # 0xffffcf8cf18742d0 Shared 1 owning threads
Threads: ffffcf8cee5e1080-02<*>
KD: Scanning for held locks....................................................................................
Resource # 0xffffcf8cdceeece0 Shared 2 owning threads
Contention Count = 4
Threads: ffffcf8ce3a1c080-01<*> ffffcf8ce5625040-01<*>
Resource # 0xffffcf8cdceeed48 Shared 1 owning threads
Threads: ffffcf8ce5625043-02<*> *** Actual Thread ffffcf8ce5625040
KD: Scanning for held locks...
Resource # 0xffffcf8cf1d377d0 Exclusively owned
Threads: ffffcf8cf0ff3080-02<*>
KD: Scanning for held locks....
Resource # 0xffffcf8cf1807050 Exclusively owned
Threads: ffffcf8ce84ec080-01<*>
KD: Scanning for held locks......
245594 total locks, 13 locks currently held
3: kd> !qlocks
Key: O = Owner, 1-n = Wait order, blank = not owned/waiting, C = Corrupt
Processor Number
Lock Name 0 1 2 3
KE - Unused Spare
MM - Unused Spare
MM - Unused Spare
MM - Unused Spare
CC - Vacb
CC - Master
EX - NonPagedPool
IO - Cancel
CC - Unused Spare
IO - Vpb
IO - Database
IO - Completion
NTFS - Struct
AFD - WorkQueue
CC - Bcb
MM - NonPagedPool
3: kd> !ready
KSHARED_READY_QUEUE fffff8053f1ada00: (00) ****------------------------------------------------------------
SharedReadyQueue fffff8053f1ada00: No threads in READY state
Processor 0: No threads in READY state
Processor 1: Ready Threads at priority 15
THREAD ffffcf8ce9dee640 Cid 2054.2100 Teb: 000000fab7bca000 Win32Thread: 0000000000000000 READY on processor 1
Processor 2: No threads in READY state
Processor 3: No threads in READY state
3: kd> dt nt!_ERESOURCE 0xffffcf8ce08d0890
+0x000 SystemResourcesList : _LIST_ENTRY [ 0xffffcf8c`e08d0610 - 0xffffcf8c`e08cf710 ]
+0x010 OwnerTable : 0xffffcf8c`ee6e8210 _OWNER_ENTRY
+0x018 ActiveCount : 0n1
+0x01a Flag : 0xf86
+0x01a ReservedLowFlags : 0x86 ''
+0x01b WaiterPriority : 0xf ''
+0x020 SharedWaiters : 0xffffae09`adcae8e0 Void
+0x028 ExclusiveWaiters : 0xffffae09`a9aabea0 Void
+0x030 OwnerEntry : _OWNER_ENTRY
+0x040 ActiveEntries : 1
+0x044 ContentionCount : 0xbfef
+0x048 NumberOfSharedWaiters : 1
+0x04c NumberOfExclusiveWaiters : 6
+0x050 Reserved2 : (null)
+0x058 Address : (null)
+0x058 CreatorBackTraceIndex : 0
+0x060 SpinLock : 0
3: kd> dx -id 0,0,ffffcf8cdcc92040 -r1 (*((ntkrnlmp!_OWNER_ENTRY *)0xffffcf8ce08d08c0))
(*((ntkrnlmp!_OWNER_ENTRY *)0xffffcf8ce08d08c0)) [Type: _OWNER_ENTRY]
[+0x000] OwnerThread : 0xffffcf8cf18e3080 [Type: unsigned __int64]
[+0x008 ( 0: 0)] IoPriorityBoosted : 0x0 [Type: unsigned long]
[+0x008 ( 1: 1)] OwnerReferenced : 0x0 [Type: unsigned long]
[+0x008 ( 2: 2)] IoQoSPriorityBoosted : 0x1 [Type: unsigned long]
[+0x008 (31: 3)] OwnerCount : 0x1 [Type: unsigned long]
[+0x008] TableSize : 0xc [Type: unsigned long]
3: kd> dx -id 0,0,ffffcf8cdcc92040 -r1 ((ntkrnlmp!_OWNER_ENTRY *)0xffffcf8cee6e8210)
((ntkrnlmp!_OWNER_ENTRY *)0xffffcf8cee6e8210) : 0xffffcf8cee6e8210 [Type: _OWNER_ENTRY *]
[+0x000] OwnerThread : 0x0 [Type: unsigned __int64]
[+0x008 ( 0: 0)] IoPriorityBoosted : 0x1 [Type: unsigned long]
[+0x008 ( 1: 1)] OwnerReferenced : 0x1 [Type: unsigned long]
[+0x008 ( 2: 2)] IoQoSPriorityBoosted : 0x1 [Type: unsigned long]
[+0x008 (31: 3)] OwnerCount : 0x0 [Type: unsigned long]
[+0x008] TableSize : 0x7 [Type: unsigned long]
Thanks for reporting this. I've tracked this down to an OS bug: there's a deadlock in wanarp. This issue appears to affect every version of the OS going back to Windows Vista.
I've filed internal issue task.ms/42393356 to track this: if you have a Microsoft support contract, your rep can get you status updates on that issue.
Meanwhile, you can partially work around this issue by either:
Indicating 1 packet at a time (NumberOfNetBufferLists==1); or
Indicating on a single CPU at a time
The bug in wanarp is exposed when 2 or more CPUs collectively process 3 or more NBLs at the same time. So either workaround would avoid the trigger conditions.
Depending on how much bandwidth you're pushing through this network interface, those options could be rather bad for CPU/battery/throughput. So please try to avoid pessimizing batching unless it's really necessary. (For example, you could make this an option that's off-by-default, unless the customer specifically uses wanarp.)
Note that you cannot fully prevent the issue yourself. Other drivers in the stack, including NDIS itself, have the right to group packets together, which would have the side effect re-batching the packets that you carefully un-batched. However, I believe that you can make a statistically significant dent in the crashes if you just indicate 1 NBL at a time, or indicate multiple NBLs on 1 CPU at a time.
Sorry this is happening to you again! wanarp is... a very old codebase.

How should I apply add-symbol-file command during u-boot linux boot debug?

I'm following linux bootloading using u-boot (using SPL falcon mode where u-boot-spl launches linux directly) on a qemu virtual machine. Now the code jumped to linux kernel and because I have done add-symbol-file vmlinux 0x80081000 I can follow the kernel code step by step using gdb connected to the virtual machine. Actually I loaded the kernel image to 0x80080000 but I had to set the address to 0x80081000 to make the source code appear on the gdb correctly according to the PC value(I don't know why this difference of 0x1000 is needed).
Later I found the kernel sets up the page table (identity mapping and swap table) and jumps to __primary_switched and this is where pure kernel virtual address is used first time for the PC. This is where the call is made at the end of the head.S file.
ldr x8, =__primary_switched
adrp x0, __PHYS_OFFSET
br x8
In the symbol file (vmlinux, an elf file), the symbols before __primary_switched are all mapped at virtual addresses (starting with 0xffffffc0..... high addresses) but the gdb could follow the source even when the PC value was using physical address. (The PC was initially loaded with physical address of the kernel start and PC relative jumps were being used until it jumps to __primary_switched, mmu disabled or using identity mapping) So does this mean, in doing add-symbol-file only the offset of the symbols from the start of text matters?
Another quetion : I can follow the kernel source with gdb but after __primary_switched, I cannot see the source. The debugger doesn't show the correct source location according to the now kernel virtual PC value. Should I tell the debugger to use correct offset using add-symbol-file again? if so how?
ADD (8:32 AM Wednesday, January 12, 2022, UTC)
I found from gdb manual,
"add-symbol-file filename [ -readnow | -readnever ] [ -o offset ] [
textaddress ] [ -s section address ... ] The add-symbol-file command
reads additional symbol table information from the file filename. You
would use this command when filename has been dynamically loaded (by
some other means) into the program that is running. The textaddress
parameter gives the memory address at which the file's text section
has been loaded. You can additionally specify the base address of
other sections using an arbitrary number of '-s section address'
pairs. If a section is omitted, gdb will use its default addresses as
found in filename. Any address or textaddress can be given as an
expression. ..."
I changed my program a little bit to fix a problem. The readelf shows the .text section starting at ffffffc010080800.
So I adjusted the command to "add-symbol-file vmlinux 0x80000800" and gdb shows the kernel source correct after jump to linux.
Still it doesn't show me the source code after __primary_switched.
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
[ 0] NULL 0000000000000000 00000000
0000000000000000 0000000000000000 0 0 0
[ 1] .head.text PROGBITS ffffffc010080000 00010000
0000000000000040 0000000000000000 AX 0 0 4
[ 2] .text PROGBITS ffffffc010080800 00010800
0000000000304370 0000000000000000 AX 0 0 2048
[ 3] .rodata PROGBITS ffffffc010390000 00320000
.... (skip) ...
[12] .notes NOTE ffffffc01045be18 003ebe18
000000000000003c 0000000000000000 A 0 0 4
[13] .init.text PROGBITS ffffffc010470000 003f0000
0000000000027ec8 0000000000000000 AX 0 0 4
[14] .exit.text PROGBITS ffffffc010497ec8 00417ec8
000000000000046c 0000000000000000 AX 0 0 4
Since '__primary_switched' resides in section .init.text, I tried adding "-s .init.text 0xffffffc010470000" or "-s .init_text 0x803ef800"(physcial
address) to the add-symbol-file command to no avail. Is my command wrong? Or could this be from page table (virtual -> Physical) problem because I see synchronous exception right after I enter __primary_switched (I see PC value has become 0x200. If the exception vector is located in 0x0, this is the vector entry for synch exception like undefined instruction. I should also check the vector base address has not been set correctly.)
I found my kernel load address was wrong (__PHYS_OFFSET was below physical ddr address start).
After fixing it, the PC increments normally with kernel virtual address and I should just apply the add-symbol-file command using the virtual address.
This was the new section addresses.
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
[ 0] NULL 0000000000000000 00000000
0000000000000000 0000000000000000 0 0 0
[ 1] .head.text PROGBITS ffffffc010080000 00010000
0000000000000040 0000000000000000 AX 0 0 4
[ 2] .text PROGBITS ffffffc010080800 00010800
0000000000304370 0000000000000000 AX 0 0 2048
[ 3] .rodata PROGBITS ffffffc010390000 00320000
00000000000a6385 0000000000000000 WA 0 0 4096
[ 4] .modinfo PROGBITS ffffffc010436385 003c6385
00000000000018ff 0000000000000000 A 0 0 1
[ 5] .pci_fixup PROGBITS ffffffc010437c90 003c7c90
00000000000020f0 0000000000000000 A 0 0 16
[ 6] __ksymtab PROGBITS ffffffc010439d80 003c9d80
0000000000006d20 0000000000000000 A 0 0 4
[ 7] __ksymtab_gpl PROGBITS ffffffc010440aa0 003d0aa0
0000000000005808 0000000000000000 A 0 0 4
[ 8] __ksymtab_strings PROGBITS ffffffc0104462a8 003d62a8
00000000000134f2 0000000000000000 A 0 0 1
[ 9] __param PROGBITS ffffffc0104597a0 003e97a0
0000000000000b68 0000000000000000 A 0 0 8
[10] __modver PROGBITS ffffffc01045a308 003ea308
0000000000000cf8 0000000000000000 A 0 0 8
[11] __ex_table PROGBITS ffffffc01045b000 003eb000
0000000000000e18 0000000000000000 A 0 0 8
[12] .notes NOTE ffffffc01045be18 003ebe18
000000000000003c 0000000000000000 A 0 0 4
[13] .init.text PROGBITS ffffffc010470000 003f0000
0000000000027ec8 0000000000000000 AX 0 0 4
[14] .exit.text PROGBITS ffffffc010497ec8 00417ec8
The final kernel image is loaded at 0x80080000. Then __PHYS_OFFSET becomes 0x80000000. (TEXT_OFFSET is 0x80000 by default). Now I can debug the kernel source before __primary_switch using this command.
add-symbol-file images/vmlinux 0x80080800 -s .head.text 0x80080000 -s .init.text 0x803f7800
And after the kernel entered __primary_switched (now kernel virtual address is used), I added this command to see the source and I can follow code using qemu and gdb step-by-step.
add-symbol-file images/vmlinux 0xffffffc010080800 -s .head.text 0xffffffc010080000 -s .init.text 0xffffffc010470000 Hope this helps someone later.
But after some days, I think I could just use add-symbol-file images/vmlinux 0xffffffc010080800 (applying all the section info).

Why am I getting an unexpected `0xcc` byte when loading nearby code bytes? Is it because of segment register %es?

I got some inconsistent result of instruction.
I don't know why this happens, so I suspect %es register is doing something weird, but I'm not sure.
Look at below code snippet.
08048400 <main>:
8048400: bf 10 84 04 08 mov $HERE,%edi
8048405: 26 8b 07 mov %es:(%edi),%eax # <----- Result 1
8048408: bf 00 84 04 08 mov $main,%edi
804840d: 26 8b 07 mov %es:(%edi),%eax # <----- Result 2
08048410 <HERE>:
8048410: 11 11 adc %edx,(%ecx)
8048412: 11 11 adc %edx,(%ecx)
Result 1:
%eax : 0x11111111
Seeing this result, I guessed that mov %es:(%edi),%eax to be something like mov (%edi),%eax.
Because 0x11111111 is stored at HERE.
Result 2:
%eax : 0x048410cc
However, the result of Result 2 was quite different.
I assumed %eax to be 0x048410bf, because this value is stored at main.
But the result was different as you can see.
Question:
Why this inconsistency of the result happens?
By the way, value of %es was always 0x7b during execution of both instruction.
es is a red herring. The difference you see is 1 byte at main, cc vs. bf. That is because you used a software breakpoint at main and your debugger inserted an int3 instruction which has machine code cc temporarily overwriting your actual code.
Do not set a breakpoint where you intend to read from, or use a hardware breakpoint instead which does not modify code.

power8 assembly code with shared build issue with save and restore of TOC

I have the following assembly code
.machine power8
.abiversion 2
.section ".toc","aw"
.section .text
GLOBAL(myfunc)
myfunc:
stdu 1,-240(1)
mflr 0
std 0, 0*8(1)
mfcr 8
std 8, 1*8(1)
std 2, 2*8(1)
# Save all non-volatile registers R14-R31
std 14, 4*8(1)
...
# Save all the non-volatile FPRs
...
stwu 1, -48(1)
bl function_call
nop
addi 1, 1, 48
ld 0, 0*8(1)
mtlr 0
ld 8, 1*8(1)
ld 2, 2*8(1)
...
# epilogue, restore stack frame
This works fine with static build but shared build gives segmentation fault in
00000157.plt_call.__tls_get_addr_opt##GLIBC_2.22, should the shared build be handled differently in power8 w.r.t TOC?
The calling convention is the same between POWER 8 and previous processors. However, there has been changes with regards to the TOC pointer (r2) handling between ABIv1 and ABIv2.
In ABIv2, the caller does not establish the TOC pointer in r2; the called function should do this for global entry points (ie, where the TOC pointer may not be the same as that used in the callee). To do this, ABIv2 functions will have a prologue that sets r2:
0000000000000000 <foo>:
0: 00 00 4c 3c addis r2,r12,0
4: 00 00 42 38 addi r2,r2,0
- this depends on r12 containing the address of the function's global entry point (those 0 values will be replaced with actual offsets at final link time).
I don't see any code setting r12 appropriately in your example. Are you sure you're complying with the v2 ABI there?
The ABIv2 spec is available here: https://members.openpowerfoundation.org/document/dl/576 Section 2.3.2 will be the most relevant for this issue.

Reliable Linux kernel timestamps (or adjustment thereof) with both usbmon and ftrace?

I'm trying to inspect a kernel module that utilizes usb, and so from the module itself I'm writing a message to ftrace using trace_printk; and then I wanted to inspect when does a USB Bulk Out URB Submit appear in the system after that.
The problem is that on my Ubuntu Lucid 11.04 (kernel 2.6.38-16), there are only local and global clocks in ftrace - and although their resolution is the same (microseconds) as the timestamps by usbmon, their values differ significantly.
So not knowing any better (as I couldn't find anywhere else talking about this), what I did was attempt to redirect usbmon to trace_marker, using cat:
# ... activate ftrace here ...
usbpid=$(sudo bash -c 'cat /sys/kernel/debug/usb/usbmon/2u > /sys/kernel/debug/tracing/trace_marker & echo $!')
sleep 3 # do test, etc.
sudo kill $usbpid
# ... deactivate ftrace here...
... and then, when I read from /sys/kernel/debug/tracing/trace, I get a log with problematic timestamps (see below). So what I'd like to know is:
Is there a way to make usbmon have it's messages appear directly in /debug/tracing/trace, instead of in /debug/usb/usbmon/2u ? (not that I can see, but I'd like to have this confirmed)
If not, is there a better way to "directly" redirect output of /sys/kernel/debug/usb/usbmon/2u without any possible overhead/buffering issues of cat and/or shell redirection?
If not, is there some sort of an algorithm, where I could use the extra usbmon timestamp, to somehow "correct" the position of these events in the kernel timestamp domain? (see example below)
Here is a brief example snippet of a /sys/kernel/debug/tracing/trace log I got:
<idle>-0 [000] 44989.403572: my_kernel_function: 1 00 00 64 1 64 5
<...>-29765 [000] 44989.403918: my_kernel_function: 1 00 00 64 2 128 2
<...>-29787 [000] 44989.404202: 0: f1f47280 3237249791 S Bo:2:002:2 -115 64 = 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<...>-29787 [000] 44989.404234: 0: f1f47080 3237250139 S Bo:2:002:2 -115 64 = 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<idle>-0 [000] 44989.404358: my_kernel_function: 1 00 00 64 3 192 4
<...>-29787 [000] 44989.404402: 0: f1f47c00 3237250515 S Bo:2:002:2 -115 64 = 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
So when the kernel timestamp is 44989.404202, the usbmon timestamp is 3237.249791 (= 3237249791/1e6); neither the seconds nor the microseconds part match. To make it a bit easier on the eyes, here's the same snippet with only time information remaining:
(1) 44989.403572 MYF 0
(2) 44989.403918 MYF 0.000346
(3) 44989.404202 USB | 0 3237.249791 0
(4) 44989.404234 USB | 0.000032 3237.250139 0.000348
(5) 44989.404358 MYF 0.000440 | |
(6) 44989.404402 USB 0.000168 3237.250515 0.000376
So judging by the kernel timestamps, 32 μs expired between event (3) and event (4) - but judging by the usbmon timestamps, 348 μs expired between the same events! Whom to trust now?!
Now, if we assume that the usbmon timestamps are more correct for those messages, given they got "printed" before they ended up in the ftrace buffer to begin with - we could assume that the first usb message (3) may have been scheduled right after (1) executed, but something preempted it - and so the second USB message (4) triggered the "printout" (or rather, "entry") of both (3) and (4) in the ftrace buffer (which is why their kernel timestamps are so clcse together?)
So, if I assume (4) is the more correct one, I can try push (3) back for 348 μs:
(1) 44989.403572 MYF 0
(3) 44989.403854 USB | 0 3237.249791 0
(2) 44989.403918 MYF 0.000346 | |
(4) 44989.404234 USB | 0.000380 3237.250139 0.000348
(5) 44989.404358 MYF 0.000440 | |
(6) 44989.404402 USB 0.000168 3237.250515 0.000376
... and that sort of looks better (also USB now fires 282 μs, 316 μs, and 44 μs after MYF) - for first and second MYF/USB pairs (if that is indeed how they behave); but then the third step doesn't really match, and so on... Cannot really think of an algorithm to help me adjust the USB events position according to the data in the usbmon timestamp...
While the best approach for redirecting usbmon output to ftrace is still an open question, I got an answer about correlating their timestamps from this thread:
Using both usbmon and ftrace? [linux-usb mailing list]
You can call the following
subroutine to get a usbmon-style timestamp value, which can then be
added to an ftrace message or simply printed in the kernel log:
#include <linux/time.h>
static unsigned usbmon_timestamp(void)
{
struct timeval tval;
unsigned stamp;
do_gettimeofday(&tval);
stamp = tval.tv_sec & 0xFFF;
stamp = stamp * 1000000 + tval.tv_usec;
return stamp;
}
For example,
pr_info("The usbmon time is: %u\n", usbmon_timestamp());

Resources