QEMU-KVM and Perf Statistics

QEMU-KVM and Perf Statistics - performance

I got some VMs running on an IBM Power8 using QEMU-KVM and I want to get statistics about LLC misses.
How can I do that in order to get statistics for each VM separately?

You want to have these data from the whole VM or for one application running on a VM?
I tested it on a Ubuntu 15.04 image over QEMU-KVM, and I am able to get it using perf. In this case, I am getting the LLC stats regarding to a gzip operation. Take a look:
$ perf stat -e LLC-loads,LLC-load-misses gzip -9 /tmp/vmlinux
Performance counter stats for 'gzip -9 /tmp/vmlinux':
263,653 LLC-loads
10,753 LLC-load-misses # 4.08% of all LL-cache hits
4.006553608 seconds time elapsed

For more detailed/explanatory content about some POWER events, refer to these documents:
Comprehensive PMU Event Reference – POWER7
Commonly Used Metrics for Performance Analysis – POWER7
The former is a more of a reference, and the latter is more of a tutorial (including a section about cache/memory hierarchy w/ hits/misses).
Those should be listed in: https://www.power.org/events/Power7

Related

Which tool for finding the reason for latency peaks on embedded Linux?

Best case would be, if I had a (debug)-tool which runs in the background and tells me the name of the process or driver that breaks my latency requirement to my system. Which tool is suitable? Do you have a short example of its usage for the following case?
Test case:
The oscilloscope measures the time between the trigger of a GPIO input and the response on a GPIO output. Usually the response time is 150µs. I trigger every 25ms.
My linux user test program uses poll() and read()+write() to mirror the detected signal of the input as response back to an output.
The Linux kernel is patched with the Preempt_rt patch.
In the dimension of hours I can see response time peaks of up to 20ms.

The best real chance is to
switch on tracing in the kernel configuration and build such Linux kernel:
CONFIG_FTRACE=y
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
CONFIG_SCHED_TRACER=y
CONFIG_FTRACE_SYSCALLS=y
CONFIG_STACK_TRACER=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_FUNCTION_PROFILER=y
CONFIG_DEBUG_FS=y
then run your application until weird things happen by using a tool trace-cmd
trace-cmd start -b 10000 -e 'sched_wakeup*' -e sched_switch -e gpio_value -e irq_handler_entry -e irq_handler_exit /tmp/myUserApplication
and get a trace.dat file.
trace-cmd stop
trace-cmd extract
Load that trace.dat file in KernelShark and analyse the CPUs, threads, interrupts, kworker threads and user space threads. It's great to see which blocks the system.

Docker volume with Grunt file watch

I'm porting an existing project with Grunt file watches to a Docker development container. The source files are bind-mounted into the container, and Grunt watches the files for changes (this can probably be optimized, but my current concern is: simply get the current setup working within Docker).
On the Mac, I'm experiencing enormous CPU usage, so I read the performance tuning guide for osxfs. The guide mentions the cached and delegated volume modes.
The description for delegated says:
the container’s view is authoritative
(permit delays before updates on the container appear in the host)
For cached:
[…] provides all the guarantees of the delegated configuration, and some additional guarantees around the visibility of writes performed by containers. As such, cached typically improves the performance of read-heavy workloads, at the cost of some temporary inconsistency between the host and the container.
In comparison to which setting does cached improve performance? Is "read-heavy workloads" seen from the container's perspective?
To cut a long story short: What's the optimal setting to reduce CPU usage for a development environment which uses file watches? cached or delegated?

Ok, so I did some testing and here's my results. Setup:
MacBook Air 11", early 2014
macOS 10.12.6
Docker 17.06.0-ce-mac19 (18663)
watch task polling for ~ 1,000 files
The culprit processes eating up CPU cycles in the host are hyperkit and com.docker.osxfs. The following percentage values are the median CPU usage taken over five samples:
delegated: 18.7 % hyperkit + 0.0 % com.docker.osxfs = 18.7 %
cached: 24.3 % hyperkit + 0.1 % com.docker.osxfs = 24.4 %
default aka. consistent: 152.0 % hyperkit + 68.9 % com.docker.osxfs = 220.9 % (!)
Functionality-wise I didn't notice any difference. When changing a file outside the container the changes were picked up virtually immediately by the watch in each of the three cases. So I'm going to use the delegated mode now.

Greenplum gp_vmem_protect_limit configuration

We are doing a PoC by installing Greenplum on AWS environment. We have setup each of our segment servers as d2.8xlarge instance types which has 240 GB of RAM with no SWAP.
I am now trying to setup the gp_vmem_protect_limit using the formula mentioned in gpdb documents and the value is coming to 25600MB.
But in one of the Zendesk Notes it says that gp_vmem_protect_limit will be breached when "sessions executing on this segment are attempting together to use more than configured limit. " Does the segment in this text mean Segment Host or number of primary segments?
Also, with the Eager Free option being set I see that the memory utilization is very poor when running the TPC-DS benchmark with 5 concurrent users. I would like to improve the memory utilization of the environment and below are the other memory configurations
gpconfig -c gp_vmem_protect_limit -v 25600MB
gpconfig -c max_statement_mem -v 16384MB
gpconfig -c statement_mem -v 2400MB
Any suggestions?
Thanks,
Jayadeep

There is a calculator for it!
http://greenplum.org/calc/
You should also add a swap file or disk. It is pretty easy to do in Amazon too. I would add at least a 4GB swap file to each host when you have 240GB of RAM.

Event-based sampling with the perf userland tool and PEBS

I'm doing event-based sampling with the perf userland tool: the objective being trying to find out where certain performance-impacting events like branch misses and cache misses are occurring on a larger system I'm working on.
Now, something like
perf record -a -e branch-misses:pp -- sleep 5
works perfectly: the PEBS counting mode trigerred by the 'pp' modifier is really accurate when collecting the IP in the samples.
Unfortunately, when I try to do the same for cache-misses, i.e.
perf record -a -e cache-misses:pp -- sleep 5 # [1]
I get
Error: sys_perf_event_open() syscall returned with 22 (Invalid argument). /bin/dmesg may provide additional information.
Fatal: No CONFIG_PERF_EVENTS=y kernel support configured?
dmesg | grep "perf\|pmu" shows nothing useful AFAICT. I'm also pretty sure that the kernel was compiled with CONFIG_PERF_EVENTS=y because both [1] and
perf record -a -e cache-misses -- sleep 5 # [2]
work : the problem with [2] being that the collected samples are not very accurate, which hurts my profiles.
Any hints on what could be going on here?

It turns out the specific event that the generic cache-misses maps to does not support PEBS. An alternative is to use one of the events that are supported by PEBS (see the list for the Nehalem architecture here) with an appropriate mask to narrow it down. Specifically, one could use MEM_LOAD_RETIRED:LLC_MISS, even though the event doesn't seem to be accurate on all occasions.

Can ETW (event tracing for windows) be used to gather also memory statistics?

Is it possible using ETW to also get memory statistics of all the processes and the system ?
With memory statistics I mean : e.g. Commited bytes, private bytes,paged pool,working set,...
I cannot find anything about using xperf to get and see memory statistics. It is always about CPU , disk , network.
One could probably use performance counters to get that kind of information, but how can one overlay the statistics graphically in one chart (how to correlate/sync the timestamps) ?

Your best bet on Windows 8.1 and higher is the Microsoft-Windows-Kernel-Memory provider, which records per-process memory information every 0.5 s. See https://github.com/google/UIforETW/issues/80 for details. UIforETW enables this by default when it is available.
You could also try the MEMINFO provider. It gives a system-wide overview of memory pressure. It shows the Active List (currently in use memory), the Standby List ('useful' pages not currently in use, such as the disk cache), and the Zero and Free lists (genuinely free memory). This at least lets you tell whether a system is running out of memory.
You could also try MEMINFO_WS and CONTMEMGEN but these are undocumented so I really don't know what they do. They show up in xperf -providers k but when I record with them I can't see any new graphs appearing. Apparently Microsoft ships these providers but no way to view them. Sigh...
If you want more memory details on Windows 7 -- such as per-process working sets -- your best bet is to have a process running which periodically queries this data and emits it in custom ETW events. This is available in a prepackaged form in UIforETW which can query the working set of a specified set of processes once a second. See the announcement post for how to get UIforETW:
https://randomascii.wordpress.com/2015/04/14/uiforetw-windows-performance-made-easier/
UIforETW's Windows 7 working set data shows up in Generic Events under Task Name == WorkingSet. On Windows 8.1 the OS working set data (more detailed, more efficiently recorded) shows up under Memory-> Virtual Memory Snapshots.

You can trace memory usage with ReferenceSet kernel group. It includes the following traceflags:
PROC_THREAD+LOADER+HARD_FAULTS+MEMORY+FOOTPRINT+VIRT_ALLOC+MEMINFO+VAMAP+SESSION+REFSET+MEMINFO_WS
MEMORY = Memory tracing
FOOTPRINT+REFSET = Support footprint analysis
MEMINFO = Memory List Info (active, standby and oters you see from ResMon)
VIRT_ALLOC = Virtual allocation reserve and release
VAMAP = mapped files information
MEMINFO_WS = Working set Info
As you can see xperf can capture a lot of memory data when you sue the right flags.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio