Perf stat equivalent for Mac OS? - macos

Is there a perf stat equivalent on Mac OS? I would like to do the same thing for a CLI command and googling is not yielding anything.

There was Instruments tool in Mac OS X to profile applications including with hardware PMU. Default is to do sampling profiler for CPU usage. Some docs: https://en.wikipedia.org/wiki/Instruments_(software) https://help.apple.com/instruments/mac/current/
It also has command line variant: https://help.apple.com/instruments/mac/current/#/devb14ffaa5
Open Terminal, in /Applications/Utilities.
instruments -t "Allocations" -D ~/Desktop/YourTraceFileName.trace PathToYourApp
Page https://gist.github.com/loderunner/36724cc9ee8db66db305 mentions tool sample ("included in a standard Mac OS X installation").
Also, Shark tool is mentioned for older versions of Mac OS X (before 10.7) and Xcode: https://en.wikipedia.org/wiki/Apple_Developer_Tools#Shark
With Intel CPU you can try Intel Vtune profiler - https://software.intel.com/en-us/get-started-with-vtune-macos https://software.intel.com/en-us/vtune
Other and more open intel tool (partially deprecated?) is https://github.com/opcm/pcm/ which has some kind of OSX support. Docs: https://software.intel.com/en-us/articles/intel-performance-counter-monitor. Requires custom MacMSRDriver driver (kext).
perf stat does counting for events, and I'm not sure how to collect counters with Instruments. Page https://www.robertpieta.com/counters-in-instruments/ shows how to configure Instruments GUI for event counting:
To configure Counters, select File -> Recording Options from the Instruments navigation menu.
For the purposes of this post, sampling by Time will be selected. Using the + you are able to add specific events that Counters can count available on the particular CPU currently connected to Instruments.
So, you at least can instruct Instruments tool to do recording of counter values periodically over time. Some problems are reported for that mode: http://hmijailblog.blogspot.com/2015/09/using-intels-performance-counters-on-os.html

I was disappointed by the lack of a CLI equivalent to perf stat -r, so I just wrote up https://github.com/cdr/timer.
Works like:
$ timer -n 4 -q sleep 1s
--- config
command sleep 1s
iterations 4
parallelism 1
--- percentiles
0 (fastest) 1.004
25 (1st quantile) 1.004
50 (median) 1.006
75 (3rd quantile) 1.008
100th (slowest) 1.008
--- summary
mean 1.006
stddev 0.002
This doesn't contain advanced execution counters, just wall clock statistics.

Related

What can be done to lower UE4Editor startup time?

Status: the problem lowered, but compared to other users reports it persists.
I have moved to UE4.27.0 and the startup time lowered from 11 (v4.26.2) to 6 minutes! (the RAM usage lowered too!) But doesnt compare to the speed other ppl report "almost instantly"...
It is not compiling anything, not even shaders, it is like the 6th time I run it for one project.
Should I try to disable plugins? but Im new with UE and dont want to difficult my usage. Tho, for ex., I have nothing VR related to test so it could really be initially disabled.
HD READ SPEED? NO
I have tested moving UE4Editor whole engine path (100GB) to a 3xSSD(Stripes), but the UE4Editor startup time remained the same. My HD were it is too, is fast but not so fast as the 3xSSD.
CPU USAGE? MAY BE if it could use 4 cores could solve it?
UE4Editor startup uses A SINGLE CORE ONLY, i can confirm with htop and system monitor, it is possible to see only a single core being used 100% and it changes between the 4 cores, so only one is used at 100% per time.
I tested this command line parameter -USEALLAVAILABLECORES after the project URL for UE4Editor, but nothing changed. I read that option is ignored in some machines, so may be if I patch it's usage it could work on mine?
GPU? no?
a report about an integrated graphics card (weak one) says it doesnt interfere with the startup time.
LOG for UE4Editor v4.27.0 with the new biggest intervals ("..." means ommited log lines to make it easier to read; "!(interval in seconds)" is just to easy reading it (no ommitted lines here)):
[2021.09.15-23.38.20:677][ 0]LogHAL: Linux SourceCodeAccessSettings: NullSourceCodeAccessor
!22s
[2021.09.15-23.38.42:780][ 0]LogTcpMessaging: Initializing TcpMessaging bridge
[2021.09.15-23.38.42:782][ 0]LogUdpMessaging: Initializing bridge on interface 0.0.0.0:0 to multicast group 230.0.0.1:6666.
!16s
[2021.09.15-23.38.58:158][ 0]LogPython: Using Python 3.7.7
...
[2021.09.15-23.39.01:817][ 0]LogImageWrapper: Warning: PNG Warning: Duplicate iCCP chunk
!75s
[2021.09.15-23.40.16:951][ 0]SourceControl: Source control is disabled
...
[2021.09.15-23.40.26:867][ 0]LogAndroidPermission: UAndroidPermissionCallbackProxy::GetInstance
!16s
[2021.09.15-23.40.42:325][ 0]LogAudioCaptureCore: Display: No Audio Capture implementations found. Audio input will be silent.
...
[2021.09.15-23.41.08:207][ 0]LogInit: Transaction tracking system initialized
!9s
[2021.09.15-23.41.17:513][ 0]BlueprintLog: New page: Editor Load
!23s
[2021.09.15-23.41.40:396][ 0]LocalizationService: Localization service is disabled
...
[2021.09.15-23.41.45:457][ 0]MemoryProfiler: OnSessionChanged
!13s
[2021.09.15-23.41.58:497][ 0]LogCook: Display: CookSettings for Memory: MemoryMaxUsedVirtual 0MiB, MemoryMaxUsedPhysical 16384MiB, MemoryMinFreeVirtual 0MiB, MemoryMinFreePhysical 1024MiB
SPECS:
I'm using ubuntu 20.04.
My CPU is 4 cores 3.6GHz.
GeForce GT 710 1GB.
Related question but for older UE4: https://answers.unrealengine.com/questions/987852/view.html
Unreal Engine needs a high-end pc with a lot of RAM, fast SSD's, a good CPU and a medium graphic card. First of all there are always some shaders that needs to be compiled from the engine, and a lot of assets to be loaded in the startup time. As I can see you're on Linux you are probably using a self-compiled Unreal Engine version.... not the best thing to do for a newbie, because this may cause several problems on load time, startup, compiling and a lot of other stuff. If it's the first times you're using Unreal, try using it on Windows, it's all easier.

Which tool for finding the reason for latency peaks on embedded Linux?

Best case would be, if I had a (debug)-tool which runs in the background and tells me the name of the process or driver that breaks my latency requirement to my system. Which tool is suitable? Do you have a short example of its usage for the following case?
Test case:
The oscilloscope measures the time between the trigger of a GPIO input and the response on a GPIO output. Usually the response time is 150µs. I trigger every 25ms.
My linux user test program uses poll() and read()+write() to mirror the detected signal of the input as response back to an output.
The Linux kernel is patched with the Preempt_rt patch.
In the dimension of hours I can see response time peaks of up to 20ms.
The best real chance is to
switch on tracing in the kernel configuration and build such Linux kernel:
CONFIG_FTRACE=y
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
CONFIG_SCHED_TRACER=y
CONFIG_FTRACE_SYSCALLS=y
CONFIG_STACK_TRACER=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_FUNCTION_PROFILER=y
CONFIG_DEBUG_FS=y
then run your application until weird things happen by using a tool trace-cmd
trace-cmd start -b 10000 -e 'sched_wakeup*' -e sched_switch -e gpio_value -e irq_handler_entry -e irq_handler_exit /tmp/myUserApplication
and get a trace.dat file.
trace-cmd stop
trace-cmd extract
Load that trace.dat file in KernelShark and analyse the CPUs, threads, interrupts, kworker threads and user space threads. It's great to see which blocks the system.

How to print kernel call stack in Mac OS X

In Linux, I can use echo t > /proc/sysrq-trigger to dump the kernel call stack of all threads in system.
Is there any method in Mac OS X for the same purpose? or any method to dump kernel stack of one process?
Short answer: procexp 0 threads (as root) will do the trick, where procexp is "Process Explorer" from http://newosxbook.com/tools/procexp.html .
Slightly Longer answer:
- Dtrace is overkill and will need SIP disablement
- stackshot is deprecated since its underlying syscall (#365) was removed
- A replacement, stack_snapshot_with_config(#491) can be used programmatically as well (this is what drives the above tool)
The answer is probably dtrace. I know Instruments.app (or iprofiler) can do probe based profiling, so it takes periodic stack traces. (user or kernel; your choice) As far as I'm aware this is all based on dtrace, although I don't know it well enough to be able to tell you a way to take a one-off trace.
Hmm... I didn't code on Mac OS X for serval years. But a tool with name 'stackshot' can help you do this. Try to google it to get the usage. :-)
From http://www.brendangregg.com/DTrace/DTrace-cheatsheet.pdf:
sudo dtrace -n 'fbt:::entry { stack(10); ustack(5) }'
prints 10 kernel frames, 5 userland frames

QEMU-KVM and Perf Statistics

I got some VMs running on an IBM Power8 using QEMU-KVM and I want to get statistics about LLC misses.
How can I do that in order to get statistics for each VM separately?
You want to have these data from the whole VM or for one application running on a VM?
I tested it on a Ubuntu 15.04 image over QEMU-KVM, and I am able to get it using perf. In this case, I am getting the LLC stats regarding to a gzip operation. Take a look:
$ perf stat -e LLC-loads,LLC-load-misses gzip -9 /tmp/vmlinux
Performance counter stats for 'gzip -9 /tmp/vmlinux':
263,653 LLC-loads
10,753 LLC-load-misses # 4.08% of all LL-cache hits
4.006553608 seconds time elapsed
For more detailed/explanatory content about some POWER events, refer to these documents:
Comprehensive PMU Event Reference – POWER7
Commonly Used Metrics for Performance Analysis – POWER7
The former is a more of a reference, and the latter is more of a tutorial (including a section about cache/memory hierarchy w/ hits/misses).
Those should be listed in: https://www.power.org/events/Power7

Event-based sampling with the perf userland tool and PEBS

I'm doing event-based sampling with the perf userland tool: the objective being trying to find out where certain performance-impacting events like branch misses and cache misses are occurring on a larger system I'm working on.
Now, something like
perf record -a -e branch-misses:pp -- sleep 5
works perfectly: the PEBS counting mode trigerred by the 'pp' modifier is really accurate when collecting the IP in the samples.
Unfortunately, when I try to do the same for cache-misses, i.e.
perf record -a -e cache-misses:pp -- sleep 5 # [1]
I get
Error: sys_perf_event_open() syscall returned with 22 (Invalid argument). /bin/dmesg may provide additional information.
Fatal: No CONFIG_PERF_EVENTS=y kernel support configured?
dmesg | grep "perf\|pmu" shows nothing useful AFAICT. I'm also pretty sure that the kernel was compiled with CONFIG_PERF_EVENTS=y because both [1] and
perf record -a -e cache-misses -- sleep 5 # [2]
work : the problem with [2] being that the collected samples are not very accurate, which hurts my profiles.
Any hints on what could be going on here?
It turns out the specific event that the generic cache-misses maps to does not support PEBS. An alternative is to use one of the events that are supported by PEBS (see the list for the Nehalem architecture here) with an appropriate mask to narrow it down. Specifically, one could use MEM_LOAD_RETIRED:LLC_MISS, even though the event doesn't seem to be accurate on all occasions.

Resources