Await time is high but %util is less for SSD disk - performance

For following SSD disk, await time is high but %util is less.
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
dm-3 0.00 0.00 132.00 2272.00 892.00 100244.00 84.14 1111.70 658.36 5.00 696.32 0.20 49.20
%util is ((r/s + w/s) * svctm/1000)*100 and util represents percentage of time device spent in servicing requests. So, ~50% util is not high.
On the other hand, await time is pretty high. Tasks actual await time in the queue is "await - svctm" i.e. (658.36 - 0.20)*100/658.36, which is close to 100%. That means that tasks are spending most of time waiting in the queue.
If util is low but await time is high, is disk being utilized properly? Which one of these two metrics is more reliable for SSDs?

Screenshot shows your write wait time is high. SSD write generally tend to get slow after usage as they fill up and because of many other reasons like,
Compatibility issues - Check compatibility with your HW
If you clone the partition from HDD to SSD
If TRIM is not enabled
If its not initialized with zeros/ones
If its behind any IOcontroller then IOcontroller itself might be slow
Check what all scenarios fits you and perform cleanup/initialization once and see if it resolves the issue. For compatibility issues, you might need to contact support team.

Related

Why does `native_write_msr` dominate my profiling result?

I have a program that runs on a multi-thread framework with Linux kernel 4.18 and Intel CPU. I ran perf record -p pid -g -e cycles:u --call-graph lbr -F 99 -- sleep 20 to collect stack trace and generate flame graph.
My program was running under a low workload, so the time spent on futex_wait is expected. But the top of the stack is a kernel function native_write_msr. According to What does native_write_msr in kernel do? and https://elixir.bootlin.com/linux/v4.18/source/arch/x86/include/asm/msr.h#L103, this function is used for performance counters. I have disabled the tracepoint in native_write_msr.
And pidstat -p pid 1 told me that the system CPU usage is quite low.
05:44:34 PM UID PID %usr %system %guest %CPU CPU Command
05:44:35 PM 1001 67441 60.00 4.00 0.00 64.00 11 my_profram
05:44:36 PM 1001 67441 58.00 7.00 0.00 65.00 11 my_profram
05:44:37 PM 1001 67441 61.00 3.00 0.00 64.00 11 my_profram
My questions are
Why does native_write_msr appear so many times in the stack traces (as a result, it occupies a large space in the flame graph for about 80%). Is it a block operation, or it realeases the CPU when called?
Why is the system CPU usage relatively low against the frame graph? According to the graph, 80% of the CPU time should belong to %system instead of %usr.
Any help is appreciated. If I miss any useful infomation, please comment.
Thank you very much!
From the flamegraph, you could find that native_write_msr function is called by the function schedule. When a running process is removed from one core (because it's migrated to another core or stopped by the scheduler to run another process), the scheduler need to dump the process's perf data and clean its perf configurations, so we don't mess up perf data of different processes. The scheduler may need to write to msr in this step, thus calling native_write_msr. So native_write_msr are called for so many times because scheduling or core migrations happens too frequently.

understanding iostat %utilization

Used below to test the limit of what throughtput the disk can achieve
dd if=/dev/zero of=test bs=4k count=25000 conv=fdatasync
with multiple runs it averaged out to about 130 MB/s
now when running cassandra on these system i am monitoring the disk usage using
iostat -dmxt 30 sdd sdb sdc
there are certain entries i want to make sure i am interpreting them correctly like below.
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sdc 0.00 2718.60 186.30 27.20 17.87 12.06 287.13 44.98 215.06 2.79 59.58
even though the sum of rMB/s + wMB/s should be roughly equal to %util(disk throughput which is 130MB/s) and i am assuming some of the utilization goes towards seek , can the difference be huge enough to take about 24% of utilization.
Thanks in advance for any help.
the frequent spin/seek does take significant amount of (latency) time. in my test, the io bandwidth between sequential io and random io is about 3x. also, it is better to use fio
(https://github.com/axboe/fio) to run this type of tests, e.g direct io, sequential read/write with proper sector size (256kb or 512kb - depending on the support from the controller) and libaio as io engine, io queue depth 64. the test will be muchly controlled.

What can be the reason for CPU load to NOT scale linearly with the number of traffic processing workers?

We are writing a Front End that is supposed to process large volume of traffic (in our case it is Diameter traffic, but that may be irrelevant to the question). As client connects, the server socket gets assigned to one of the Worker processes that perform all the actual traffic processing. In other words, Worker does all the work, and more Workers should be added when more clients get connected.
One would expect the CPU load per message to be the same for different number of Workers, because Workers are totally independent, and serve different sets of client connections. Yet our tests show that it takes more CPU time per message, as the number of Workers grow.
To be more precise, the CPU load depends on the TPS (Transactions or Request-Responses per second) as follows.
For 1 Worker:
60K TPS - 16%, 65K TPS - 17%... i.e. ~0.26% CPU per KTPS
For 2 Workers:
80K TPS - 29%, 85K TPS - 30%... i.e. ~0.35% CPU per KTPS
For 4 Workers:
85K TPS - 33%, 90K TPS - 37%... i.e. ~0.41% CPU per KTPS
What is the explanation for this? Workers are independent processes and there is no inter-process communication between them. Also each Worker is single-threaded.
The programming language is C++
This effect is observed on any hardware, which is close to this one: 2 Intel Xeon CPU, 4-6 cores, 2-3 GHz
OS: RedHat Linux (RHEL) 5.8, 6.4
CPU load measurements are done using mpstat and top.
If either the size of the program code used by a worker or the size of the data processed by a worker (or both) is non-small, the reason could be the reduced effectiveness of the various caches: The locality-over-time of how a single worker accesses its program code and/or its data is disturbed by other workers intervening.
The effect can be quite complicated to understand, because:
it depends massively on the structure of your code's computations,
modern CPUs have about three levels of cache,
each cache has a different size,
some caches are local to one core, others are not,
how often the workers intervene depends on your operating system's scheduling strategy
which gets even more complicated if there are multiple cores,
unless your programming language's run-time system also intervenes,
in which case it is more complicated still,
your network interface is a computer of its own and has a cache, too,
and probably more.
Caveat: Given the relatively coarse granularity of process scheduling, the effect of this ought not to be as large as it is, I think.
But then: Have you looked up how "percent of CPU" is even defined?
Until you reach CPU saturation on your machine you cannot be sure that the effect is actually as large as it looks. And when you do reach saturation, it may not be the CPU at all that is the bottleneck here, so are you sure you need to care about CPU load?
I complete agree with #Lutz Prechelt. Here I just want to add the method about how to investigate on the issue and the answer is Perf.
Perf is a performance analyzing tool in Linux which collects both kernel and userspace events and provide some nice metrics. It’s been widely used in my team to find bottom neck in CPU-bound applications.
the output of perf is like this:
Performance counter stats for './cache_line_test 0 1 2 3':
1288.050638 task-clock # 3.930 CPUs utilized
185 context-switches # 0.144 K/sec
8 cpu-migrations # 0.006 K/sec
395 page-faults # 0.307 K/sec
3,182,411,312 cycles # 2.471 GHz [39.95%]
2,720,300,251 stalled-cycles-frontend # 85.48% frontend cycles idle [40.28%]
764,587,902 stalled-cycles-backend # 24.03% backend cycles idle [40.43%]
1,040,828,706 instructions # 0.33 insns per cycle
# 2.61 stalled cycles per insn [51.33%]
130,948,681 branches # 101.664 M/sec [51.48%]
20,721 branch-misses # 0.02% of all branches [50.65%]
652,263,290 L1-dcache-loads # 506.396 M/sec [51.24%]
10,055,747 L1-dcache-load-misses # 1.54% of all L1-dcache hits [51.24%]
4,846,815 LLC-loads # 3.763 M/sec [40.18%]
301 LLC-load-misses # 0.01% of all LL-cache hits [39.58%]
It output your cache miss rate with will easy you to tune your program and see the effect.
I write a article about cache line effects and perf and you can read it for more details.

Analyzing cause of performance regression with different kernel version

I have come across a strange performance regression from Linux kernel 3.11 to 3.12 on x86_64 systems.
Running Mark Stock's Radiance benchmark on Fedora 20, 3.12 is noticeably slower. Nothing else is changed - identical binary, identical glibc - I just boot a different kernel version, and the performance changes.
The timed program, rpict, is 100% CPU bound user-level code.
Before I report this as a bug, I'd like to find the cause for this behavior. I don't know a lot about the Linux kernel, and the change log from 3.11 to 3.12 does not give me any clue.
I observed this on two systems, an Intel Haswell (i7-4771) and an AMD Richland (A8-6600K).
On the Haswell system user time went from 895 sec with 3.11 to 962 with 3.12. On the Richland, from 1764 to 1844. These times are repeatable to within a few seconds.
I did some profiling with perf, and found that IPC went down in the same proportion as the slowdown. On the Haswell system, this seems to be caused by more missed branches, but why should the prediction rate go down? Radiance does use the random number generator - could "better" randomness cause the missed branches? But apart from OMAP4 support, the RNG does not have to seem changed in 3.12.
On the AMD system, perf just points to more idle backend cycles, but the cause is not clear.
Haswell system:
3.11.10 895s user, 3.74% branch-misses, 1.65 insns per cycle
3.12.6 962s user, 4.22% branch-misses, 1.52 insns per cycle
Richland system:
3.11.10 1764s user, 8.23% branch-misses, 0.75 insns per cycle
3.12.6 1844s user, 8.26% branch-misses, 0.72 insns per cycle
I also looked at a diff from the dmesg output of both kernels, but did not see anything that might have caused such a slowdown of a CPU-bound program.
I tried switching the cpufreq governor from the default ondemand to peformance but that did not have any effect.
The executable was compiled using gcc 4.7.3 but not using AVX instructions. libm still seems to use some AVX (e.g. __ieee754_pow_fma4) but these functions are only 0.3% of total execution time.
Additional info:
Diff of kernel configs
diff of the dmesg outputs on the Haswell system.
diff of /proc/pid/maps - 3.11 maps only one heap region; 3.12 lots.
perf stat output from the A8-6600K system
perf stats w/ TLB misses dTLB stats look very different!
/usr/bin/time -v output from the A8-6600K system
Any ideas (apart from bisecting the kernel changes)?
Let's check your perf stat outputs: http://www.chr-breitkopf.de/tmp/perf-stat.A8.txt
Kernel 3.11.10
1805057.522096 task-clock # 0.999 CPUs utilized
183,822 context-switches # 0.102 K/sec
109 cpu-migrations # 0.000 K/sec
40,451 page-faults # 0.022 K/sec
7,523,630,814,458 cycles # 4.168 GHz [83.31%]
628,027,409,355 stalled-cycles-frontend # 8.35% frontend cycles idle [83.34%]
2,688,621,128,444 stalled-cycles-backend # 35.74% backend cycles idle [33.35%]
5,607,337,995,118 instructions # 0.75 insns per cycle
# 0.48 stalled cycles per insn [50.01%]
825,679,208,404 branches # 457.425 M/sec [66.67%]
67,984,693,354 branch-misses # 8.23% of all branches [83.33%]
1806.804220050 seconds time elapsed
Kernel 3.12.6
1875709.455321 task-clock # 0.999 CPUs utilized
192,425 context-switches # 0.103 K/sec
133 cpu-migrations # 0.000 K/sec
40,356 page-faults # 0.022 K/sec
7,822,017,368,073 cycles # 4.170 GHz [83.31%]
634,535,174,769 stalled-cycles-frontend # 8.11% frontend cycles idle [83.34%]
2,949,638,742,734 stalled-cycles-backend # 37.71% backend cycles idle [33.35%]
5,607,926,276,713 instructions # 0.72 insns per cycle
# 0.53 stalled cycles per insn [50.01%]
825,760,510,232 branches # 440.239 M/sec [66.67%]
68,205,868,246 branch-misses # 8.26% of all branches [83.33%]
1877.263511002 seconds time elapsed
There are almost 300 Gcycles more for 3.12.6 in the "cycles" field; and only 6,5 Gcycles were stalls of frontend and 261 Gcycles were stalled in the backend. You have only 0,2 G of additional branch misses (each cost about 20 cycles - per optim.manual page 597; so 4Gcycles), so I think that your performance problems are related to memory subsystem problems (more realistict backend event, which can be influenced by kernel). Pagefaults diffs and migration count are low, and I think they will not slowdown test directly (but migrations may move program to the worse place).
You should go deeper into perf counters to find the exact type of problem (it will be easier if you have shorter runs of test). The Intel's manual http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf will help you. Check page 587 (B.3.2) for overall events hierarchy (FE and BE stalls are here too), B.3.2.1-B.3.2.3 for info on backend stalls and how to start digging (checks for cache events, etc) and below.
How can kernel influence the memory subsystem? It can setup different virtual-to-physical mapping (hardly the your case), or it can move process farther from data. You have not-NUMA machine, but Haswell is not the exact UMA - there is a ring bus and some cores are closer to memory controller or to some parts of shared LLC (last level cache). You can test you program with taskset utility, bounding it to some core - kernel will not move it to other core.
UPDATE: After checking your new perf stats from A8 we see that there are more DLTB-misses for 3.12.6. With changes in /proc/pid/maps (lot of short [heap] sections instead of single [heap], still no exact info why), I think that there can be differences in transparent hugepage (THP; with 2M hugepages there are less TLB entries needed for the same amount of memory and less tlb misses), for example in 3.12 it can't be applied due to short heap sections.
You can check your /proc/PID/smaps for AnonHugePages and /proc/vmstat for thp* values to see thp results. Values are documented here kernel.org/doc/Documentation/vm/transhuge.txt
#osgx You found the cause! After echo never > /sys/kernel/mm/transparent_hugepage/enabled, 3.11.10 takes as long as 3.12.6!
Good news!
Additional info on how to disable the randomization, and on where to report this as a bug (a 7% performance regression is quite severe) would be appreciated
I was wrong, this multi-heap section effect is not the brk randomisation (which changes only beginning of the heap). This is failure of VMA merging in do_brk; don't know why, but some changes for VM_SOFTDIRTY were seen in mm between 3.11.10 - 3.12.6.
UPDATE2: Possible cause of not-merging VMA:
http://lxr.missinglinkelectronics.com/linux+v3.11/mm/mmap.c#L2580 do_brk in 3.11
http://lxr.missinglinkelectronics.com/linux+v3.11/mm/mmap.c#L2577 do_brk in 3.12
3.12 just added at the end of do_brk
2663 vma->vm_flags |= VM_SOFTDIRTY;
2664 return addr;
And bit above we have
2635 /* Can we just expand an old private anonymous mapping? */
2636 vma = vma_merge(mm, prev, addr, addr + len, flags,
2637 NULL, NULL, pgoff, NULL);
and inside vma_merge there is test for vm_flags
http://lxr.missinglinkelectronics.com/linux+v3.11/mm/mmap.c#L994 3.11
http://lxr.missinglinkelectronics.com/linux+v3.12/mm/mmap.c#L994 3.12
1004 /*
1005 * We later require that vma->vm_flags == vm_flags,
1006 * so this tests vma->vm_flags & VM_SPECIAL, too.
1007 */
vma_merge --> can_vma_merge_before --> is_mergeable_vma ...
898 if (vma->vm_flags ^ vm_flags)
899 return 0;
But at time of check, new vma is not marked as VM_SOFTDIRTY, while old is already marked.
This change could be a likely candidate http://marc.info/?l=linux-kernel&m=138012715018064. I say this loosely as I don't have the resources to confirm. Its worth noting that this was the only significant change to the scheduler between 3.11.10 and 3.12.6.
Anyhow I'm very interested to see the end results of your findings so keep us posted.

Application not running at full speed?

I have the following scenario:
machine 1: receives messages from outside and processes them (via a
Java application). For processing it relies on a database (on machine
2)
machine 2: an Oracle DB
As performance metrics I usually look at the value of processed messages per time.
Now, what puzzles me: none of the 2 machines is working on "full speed". If I look at typical parameters (CPU utilization, CPU load, I/O bandwidth, etc.) both machines look as they have not enough to do.
What I expect is that one machine, or one of the performance related parameters limits the overall processing speed. Since I cannot observe this I would expect a higher message processing rate.
Any ideas what might limit the overall performance? What is the bottleneck?
Here are some key values during workload:
Machine 1:
CPU load average: 0.75
CPU Utilization: System 12%, User 13%, Wait 5%
Disk throughput: 1 MB/s (write), almost no reads
average tps (as reported by iostat): 200
network: 500 kB/s in, 300 kB/s out, 1600 packets/s in, 1600 packets/s out
Machine 2:
CPU load average: 0.25
CPU Utilization: System 3%, User 15%, Wait 17%
Disk throughput: 4.5 MB/s (write), 3.5 MB/s (read)
average tps (as reported by iostat): 190 (very short peaks to 1000-1500)
network: 250 kB/s in, 800 kB/s out, 1100 packets/s in, 1100 packets/s out
So for me, all values seem not to be at any limit.
PS: for testing of course the message queue is always full, so that both machines have enough work to do.
To find bottlenecks you typically need to measure also INSIDE the application. That means profiling the java application code and possibly what happens inside Oracle.
The good news is that you have excluded at least some possible hardware bottlenecks.

Resources