Perf report's function at 0xffffffffffffffff - performance

I am trying to grok what myApp is very busy with (90% cpu single thread). It is a server that i shouldn't restart. I've collected samples by
perf record -p 5068 -F 99 --call-graph dwarf sleep 10
And perf report gives me this:
+ 100.00% 0.00% myApp [unknown] [.] 0xffffffffffffffff ◆
+ 80.67% 0.67% myApp myApp [.] pipeline_run ▒
+ 67.71% 0.00% myApp myApp [.] QueryProcessor::process
I spent some time googling and reading docs, and I suspect that 0xffffffffffffffff could not be resolved because perf does not know where stack bottom is as it didn't start that process. But could someone confirm it or point me to the right direction?

In my case this was caused by perf using a too small stack dump size. That will cause the bottom of the truncated stack to become 0xffffffffffffffff, which will make perf report and friends think that the function 0xffffffffffffffff is used a lot.
When I used the maximum stack dump capture size, I got rid of most of the 0xffffffffffffffff. If you pass --call-graph dwarf the default stack dump size of 8192 will be used. To maximize it, change to --call-graph dwarf,65528.
From perf record --help:
When "dwarf" recording is used, perf also records (user) stack dump
when sampled. Default size of the stack dump is 8192 (bytes).
User can change the size by passing the size after comma like
"--call-graph dwarf,4096".
If you try with a value larger than 65528 you get
callchain: Incorrect stack dump size (max 65528): 128000

Related

Recompile Linux Kernel not to use specific CPU register

I'm doing an experiment that write the index of loop into a CPU register R11, then building it with gcc -ffixed-r11 try to let compiler know do not use that reg, and finally using perf to measure it.
But when I check the report (using perf script), the R11 value of most record entry is not what I expected, it supposed to be the number sequence like 1..2..3 or 1..4..7, etc. But actually it just a few fixed value. (possibly affected by system call overwriting?)
How can I let perf records the value I set to the register in my program? Or I must to recompile the whole kernel with -ffixed-r11 to achieve?
Thanks everyone.
You should not try to recompile kernel when you just want to sample some register with perf. As I understand, kernel has its own set of registers and will not overwrite user R11. syscall interface uses some fixed registers which can't be changed (can you try different reg?) and there are often glibc gateways to syscall which may use some additional registers (they are not in kernel, they are user-space code; often generated or written in assembler). You may try using gdb to monitor the register to change to find who did it. It can do this (hmm, one more link to the same user on SO): gdb: breakpoint when register will have value 0xffaa like gdb ./program then gdb commands start; watch $r11; continue; where.
Two weeks age there was question perf-report show value of CPU register about register value sampling with perf:
I follow this document and using perf record with --intr-regs=ax,bx,r15, trying to log additional CPU register information with PEBS record.
While there was x86 & PEBS, ARM may have --intr-regs implemented too. Check output of perf record --intr-regs=\? (man perf-record: "To list the available registers use --intr-regs=\?") to find support status and register names.
To print registers, use perf script -F ip,sym,iregs command. There was example in some linux commits:
# perf record --intr-regs=AX,SP usleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.016 MB perf.data (8 samples) ]
# perf script -F ip,sym,iregs | tail -5
ffffffff8105f42a native_write_msr_safe AX:0xf SP:0xffff8802629c3c00
ffffffff8105f42a native_write_msr_safe AX:0xf SP:0xffff8802629c3c00
ffffffff81761ac0 _raw_spin_lock AX:0xffff8801bfcf8020 SP:0xffff8802629c3ce8
ffffffff81202bf8 __vma_adjust_trans_huge AX:0x7ffc75200000 SP:0xffff8802629c3b30
ffffffff8122b089 dput AX:0x101 SP:0xffff8802629c3c78
#
If you need cycle accurate profile of to the metal CPU activity then perf is not the right tool, as it is at best an approximation due to the fact it only samples the program at select points. See this video on perf by Clang developer Chandler Carruth.
Instead, you should single step through the program in order to monitor exactly what is happening to the registers. Or you could program your system bare metal without an OS, but that is probably outside the scope here.

systemd MemoryLimit not enforced

I am running systemd version 219.
root#EVOvPTX1_RE0-re0:/var/log# systemctl --version
systemd 219
+PAM -AUDIT -SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP -LIBCRYPTSETUP -GCRYPT +GNUTLS +ACL +XZ -LZ4 -SECCOMP +BLKID -ELFUTILS +KMOD -IDN
I have a service, let's call it foo.service which has the following.
[Service]
MemoryLimit=1G
I have deliberately added code to allocate 1M memory 4096 times which causes
4G memory alloc when a certain event is received. The idea is that after
the process consumes 1G of address space, memory alloc would start failing.
However, this does not seem to be the case. I am able to alloc 4G memory
without any issues. This tells me that the memory limit specified in the
service file is not enforced.
Can anyone let me know what am I missing ?
I looked at the proc file system - file named limits. This shows that the
Max address space is Unlimited, which also confirms that the memory limit
is not getting enforced.
This distinction is that you have allocated memory, but you haven't actually used it. In the output of top, this is the difference between the "VIRT" memory column (allocated) and the "RES" column (actually used).
Try modifying your experiment to assign values to elements of a large array instead of just allocating memory and see if you hit the memory limit that way.
Reference: Resident and Virtual memory on Linux: A short example

The !address windbg command says that heap address is 'REGionUsageIsVAD' even though it was allocated using HeapAlloc

My heap buffer of interest was allocated as follows:
0:047> !heap -p -a 1d7cd1f0
address 1d7cd1f0 found in
_DPH_HEAP_ROOT # 5251000
in busy allocation ( DPH_HEAP_BLOCK: UserAddr UserSize - VirtAddr VirtSize)
1cf8f5b0: 1d7cc008 3ff8 - 1d7cb000 6000
68448e89 verifier!AVrfDebugPageHeapAllocate+0x00000229
76e465ee ntdll!RtlDebugAllocateHeap+0x00000030
76e0a793 ntdll!RtlpAllocateHeap+0x000000c4
76dd5dd0 ntdll!RtlAllocateHeap+0x0000023a
000ca342 TEST+0x0002a342
000be639 TEST+0x0001e639
As you can see, it was allocated using HeapAlloc(). When I run the !address command on the pointer of this heap I get:
ProcessParametrs 01699928 in range 01699000 0169a000
Environment 016976e8 in range 01697000 01698000
1d790000 : 1d7cb000 - 00005000
Type 00020000 MEM_PRIVATE
Protect 00000004 PAGE_READWRITE
State 00001000 MEM_COMMIT
Usage RegionUsageIsVAD
It claims to be in RegionUsageIsVAD. According to this stackoverflow answer, RegionUsageIsVAD generally means two things:
This is a .NET application in which case, the CLR allocates this
block of memory.
The application calls VirtualAlloc to allocate a
bloc of memory.
My scenario does not fit either one of these cases. I confirmed that CLR wasn't used by running .cordll -ve -u -l to which I got:
CLR DLL status: No load attempts
What does RegionUsageIsVAD mean in this case?
i reread your question thinking i would update what i commented
but upon closer look it seems there are lot of holes
it appears you copied things and didnt paste right
where is that pointer on heap ?
01699928 which version of windbg are you using
since i couldn't confirm i cooked up a simple program
enabled hpa in gflags and executed the exe under windbg
below is the screen shot
except what you paste as isregionvad ( this line is output under kernel !address not in user !address ) every thing else appears to be similar in the screenshot

Coredump size different than process virtual memory space

I'm working on OS X 10.11, and generated dump file in the following manner :
1. ulimit -c unlimited
2. kill -10 5228 (process pid)
and got dump file with the rolling attributes : 642M Jun 26 15:00 core.5228
Right before that, I checked the process total memory space using vmmap command to try and estimate the expected dump size.
However, the estimation (238.7Mb) was much smaller than the actual size (642Mb).
Can this gap be explained ?
VIRTUAL REGION
REGION TYPE SIZE COUNT (non-coalesced)
=========== ======= =======
Activity Tracing 2048K 2
Kernel Alloc Once 4K 2
MALLOC guard page 16K 4
MALLOC metadata 180K 6
MALLOC_SMALL 56.0M 4 see MALLOC ZONE table below
MALLOC_SMALL (empty) 8192K 2 see MALLOC ZONE table below
MALLOC_TINY 8192K 3 see MALLOC ZONE table below
STACK GUARD 56.0M 2
Stack 8192K 2
__DATA 1512K 44
__LINKEDIT 90.9M 4
__TEXT 8336K 44
shared memory 12K 4
=========== ======= =======
TOTAL 238.7M 110
VIRTUAL ALLOCATION BYTES REGION
MALLOC ZONE SIZE COUNT ALLOCATED % FULL COUNT
=========== ======= ========= ========= ====== ======
DefaultMallocZone_0x100e42000 72.0M 7096 427K 0% 6
coredump can, and does, filter the process memory. See the core man page:
Controlling which mappings are written to the core dump
Since kernel 2.6.23, the Linux-specific /proc/PID/coredump_filter file can be used to control which memory segments are written to the core dump file in the event that a core dump is performed for the process with the corresponding process ID.
The value in the file is a bit mask of memory mapping types (see mmap(2)). If a bit is set in the mask, then memory mappings of the corresponding type are dumped; otherwise they are not dumped. The bits in this file have the following meanings:
bit 0 Dump anonymous private mappings.
bit 1 Dump anonymous shared mappings.
bit 2 Dump file-backed private mappings.
bit 3 Dump file-backed shared mappings.
bit 4 (since Linux 2.6.24)
Dump ELF headers.
bit 5 (since Linux 2.6.28)
Dump private huge pages.
bit 6 (since Linux 2.6.28)
Dump shared huge pages.
bit 7 (since Linux 4.4)
Dump private DAX pages.
bit 8 (since Linux 4.4)
Dump shared DAX pages.
By default, the following bits are set: 0, 1, 4 (if the CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS kernel configuration option is enabled), and 5. This default can be modified at boot time using the coredump_filter boot option.
I assume OS X behaves similarly.

Investigating Memory Leak

We have a slow memory leak in our application and I've already gone through the following steps in trying to analyize the cause for the leak:
Enabling user mode stack trace database in GFlags
In Windbg, typing the following command: !heap -stat -h 1250000 (where 1250000 is the address of the heap that has the leak)
After comparing multiple dumps, I see that a memory blocks of size 0xC are increasing over time and are probably the memory that is leaked.
typing the following command: !heap -flt s c
gives the UserPtr of those allocations and finally:
typing !heap -p -a address on some of those addresses always shows the following allocation call stack:
0:000> !heap -p -a 10576ef8
address 10576ef8 found in
_HEAP # 1250000
HEAP_ENTRY Size Prev Flags UserPtr UserSize - state
10576ed0 000a 0000 [03] 10576ef8 0000c - (busy)
mscoreei!CLRRuntimeInfoImpl::`vftable'
7c94b244 ntdll!RtlAllocateHeapSlowly+0x00000044
7c919c0c ntdll!RtlAllocateHeap+0x00000e64
603b14a4 mscoreei!UtilExecutionEngine::ClrHeapAlloc+0x00000014
603b14cb mscoreei!ClrHeapAlloc+0x00000023
603b14f7 mscoreei!ClrAllocInProcessHeapBootstrap+0x0000002e
603b1614 mscoreei!operator new[]+0x0000002b
603d402b +0x0000005f
603d5142 mscoreei!GetThunkUseState+0x00000025
603d6fe8 mscoreei!_CorDllMain+0x00000056
79015012 mscoree!ShellShim__CorDllMain+0x000000ad
7c90118a ntdll!LdrpCallInitRoutine+0x00000014
7c919a6d ntdll!LdrpInitializeThread+0x000000c0
7c9198e6 ntdll!_LdrpInitialize+0x00000219
7c90e457 ntdll!KiUserApcDispatcher+0x00000007
This looks like thread initialization call stack but I need to know more than this.
What is the next step you would recommend to do in order to put the finger at the exact cause for the leak.
The stack recorded when using GFlags is done without utilizing .pdb and often not correct.
Since you have traced the leak down to a specific size on a given heap, you can try
to set a live break in RtlAllocateHeap and inspect the stack in windbg with proper symbols. I have used the following with some success. You must edit it to suit your heap and size.
$$ Display stack if heap handle eq 0x00310000 and size is 0x1303
$$ ====================================================================
bp ntdll!RtlAllocateHeap "j ((poi(#esp+4) = 0x00310000) & (poi(#esp+c) = 0x1303) )'k';'gc'"
Maybe you then get another stack and other ideas for the offender.
The first thing is that the new operator is the new [] operator so is there a corresponding delete[] call and not a plain old delete call?
If you suspect this code I would put a test harness around it, for instance put it in a loop and execute it 100 or 1000 times, does it still leak and proportionally.
You can also measure the memory increase using process explorer or programmatically using GetProcessInformation.
The other obvious thing is to see what happens when you comment out this function call, does the memory leak go away? You may need to do a binary chop if possible of the code to reduce the likely suspect code by half (roughly) each time by commenting out code, however, changing the behaviour of the code may cause more problems or dependant code path issues which can cause memory leaks or strange behaviour.
EDIT
Ignore the following seeing as you are working in a managed environment.
You may also consider using the STL or better yet boost reference counted pointers like shared_ptr or scoped_array for array structures to manage the lifetime of the objects.

Resources