understanding iostat %utilization - performance

Used below to test the limit of what throughtput the disk can achieve
dd if=/dev/zero of=test bs=4k count=25000 conv=fdatasync
with multiple runs it averaged out to about 130 MB/s
now when running cassandra on these system i am monitoring the disk usage using
iostat -dmxt 30 sdd sdb sdc
there are certain entries i want to make sure i am interpreting them correctly like below.
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sdc 0.00 2718.60 186.30 27.20 17.87 12.06 287.13 44.98 215.06 2.79 59.58
even though the sum of rMB/s + wMB/s should be roughly equal to %util(disk throughput which is 130MB/s) and i am assuming some of the utilization goes towards seek , can the difference be huge enough to take about 24% of utilization.
Thanks in advance for any help.

the frequent spin/seek does take significant amount of (latency) time. in my test, the io bandwidth between sequential io and random io is about 3x. also, it is better to use fio
(https://github.com/axboe/fio) to run this type of tests, e.g direct io, sequential read/write with proper sector size (256kb or 512kb - depending on the support from the controller) and libaio as io engine, io queue depth 64. the test will be muchly controlled.

Related

testing sequential disk write performance with fio and iostat

I am trying to make sense of sequential disk write performance on a spinning hard disk. I am using direct and sync io to bypass the page cache. For small block size (4KB) fio reports an iops of ~11. So this means fio is issuing 11 write system calls, each of 4k size (so total bandwidth = 11*4k = 44kb/s). But when I monitor the disk using iostat, it tells me that the disk is seeing ~60iops (w/s), with average request size of 4k (wareq-sz), for a total bandwidth of 60*4k ~ 240kb/s (wkB/s). So my questions are the following
Why is my throughput so low even when doing sequential writes? (small block size should not really matter because the disk head should not move around much)
Who is causing the 3x write amplification which is seen in iostat.
I am enclosing the fio jobs file as well as iostat output
jobs file
[global]
filename=/mnt/500gbhdd/fio_file
runtime=30s
ioengine=sync
time_based
direct=1
sync=1
rw=write
size=5G
wait_for_previous
[4k]
bs=4k

Why does `native_write_msr` dominate my profiling result?

I have a program that runs on a multi-thread framework with Linux kernel 4.18 and Intel CPU. I ran perf record -p pid -g -e cycles:u --call-graph lbr -F 99 -- sleep 20 to collect stack trace and generate flame graph.
My program was running under a low workload, so the time spent on futex_wait is expected. But the top of the stack is a kernel function native_write_msr. According to What does native_write_msr in kernel do? and https://elixir.bootlin.com/linux/v4.18/source/arch/x86/include/asm/msr.h#L103, this function is used for performance counters. I have disabled the tracepoint in native_write_msr.
And pidstat -p pid 1 told me that the system CPU usage is quite low.
05:44:34 PM UID PID %usr %system %guest %CPU CPU Command
05:44:35 PM 1001 67441 60.00 4.00 0.00 64.00 11 my_profram
05:44:36 PM 1001 67441 58.00 7.00 0.00 65.00 11 my_profram
05:44:37 PM 1001 67441 61.00 3.00 0.00 64.00 11 my_profram
My questions are
Why does native_write_msr appear so many times in the stack traces (as a result, it occupies a large space in the flame graph for about 80%). Is it a block operation, or it realeases the CPU when called?
Why is the system CPU usage relatively low against the frame graph? According to the graph, 80% of the CPU time should belong to %system instead of %usr.
Any help is appreciated. If I miss any useful infomation, please comment.
Thank you very much!
From the flamegraph, you could find that native_write_msr function is called by the function schedule. When a running process is removed from one core (because it's migrated to another core or stopped by the scheduler to run another process), the scheduler need to dump the process's perf data and clean its perf configurations, so we don't mess up perf data of different processes. The scheduler may need to write to msr in this step, thus calling native_write_msr. So native_write_msr are called for so many times because scheduling or core migrations happens too frequently.

Await time is high but %util is less for SSD disk

For following SSD disk, await time is high but %util is less.
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
dm-3 0.00 0.00 132.00 2272.00 892.00 100244.00 84.14 1111.70 658.36 5.00 696.32 0.20 49.20
%util is ((r/s + w/s) * svctm/1000)*100 and util represents percentage of time device spent in servicing requests. So, ~50% util is not high.
On the other hand, await time is pretty high. Tasks actual await time in the queue is "await - svctm" i.e. (658.36 - 0.20)*100/658.36, which is close to 100%. That means that tasks are spending most of time waiting in the queue.
If util is low but await time is high, is disk being utilized properly? Which one of these two metrics is more reliable for SSDs?
Screenshot shows your write wait time is high. SSD write generally tend to get slow after usage as they fill up and because of many other reasons like,
Compatibility issues - Check compatibility with your HW
If you clone the partition from HDD to SSD
If TRIM is not enabled
If its not initialized with zeros/ones
If its behind any IOcontroller then IOcontroller itself might be slow
Check what all scenarios fits you and perform cleanup/initialization once and see if it resolves the issue. For compatibility issues, you might need to contact support team.

Heavy disk io write in select statement in HIVE

In hive I running a query -
select ret[0],ret[1],ret[2],ret[3],ret[4],ret[5],ret[6] from (select combined1(extra) as ret from log_test1) a ;
Here ret[0],ret[1],ret[2] ... are domain, date, IP, etc. This query is doing heavy write on disk.
iostat result on one of the box in cluster.
avg-cpu: %user %nice %system %iowait %steal %idle
20.65 0.00 1.82 57.14 0.00 20.39
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
xvda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvdb 0.00 0.00 0.00 535.00 0.00 23428.00 87.58 143.94 269.11 0.00 269.11 1.87 100.00
My mapper is basically stuck in disk IO. I have 3 box cluster. My yarn configuration is
Mapper memory(mapreduce.map.memory.mb)=2GB,
I/O Sort Memory Buffer=1 GB.
I/O Sort Spill Percent=0.8
Counters of my jobs are
FILE: Number of bytes read 0
FILE: Number of bytes written 2568435
HDFS: Number of bytes read 1359720216
HDFS: Number of bytes written 19057298627
Virtual memory (bytes) snapshot 24351916032
Total committed heap usage (bytes) 728760320
Physical memory (bytes) snapshot 2039455744
Map input records 76076426
Input split bytes 2738
GC time elapsed (ms) 55602
Spilled Records 0
As mapper should initially write everything in RAM and when RAM gets full(I/O Sort Memory Buffer),it should spill the data into disk. But as I am seeing, Spilled Records=0 and also mapper is not using full RAM, still there is so heavy disk write.
Even when I am running query
select combined1(extra) from log_test1;
I am getting same heavy disk io write.
What can be the reason of this heavy disk write and how can I reduce this heavy disk write ? As in this case disk io is becoming bottleneck for my mapper.
It may be that your subquery is being written to disk before the second stage of the processing takes place. You should use Explain to examine the execution plan.
You could try rewriting your subquery as a CTE https://cwiki.apache.org/confluence/display/Hive/Common+Table+Expression

Application not running at full speed?

I have the following scenario:
machine 1: receives messages from outside and processes them (via a
Java application). For processing it relies on a database (on machine
2)
machine 2: an Oracle DB
As performance metrics I usually look at the value of processed messages per time.
Now, what puzzles me: none of the 2 machines is working on "full speed". If I look at typical parameters (CPU utilization, CPU load, I/O bandwidth, etc.) both machines look as they have not enough to do.
What I expect is that one machine, or one of the performance related parameters limits the overall processing speed. Since I cannot observe this I would expect a higher message processing rate.
Any ideas what might limit the overall performance? What is the bottleneck?
Here are some key values during workload:
Machine 1:
CPU load average: 0.75
CPU Utilization: System 12%, User 13%, Wait 5%
Disk throughput: 1 MB/s (write), almost no reads
average tps (as reported by iostat): 200
network: 500 kB/s in, 300 kB/s out, 1600 packets/s in, 1600 packets/s out
Machine 2:
CPU load average: 0.25
CPU Utilization: System 3%, User 15%, Wait 17%
Disk throughput: 4.5 MB/s (write), 3.5 MB/s (read)
average tps (as reported by iostat): 190 (very short peaks to 1000-1500)
network: 250 kB/s in, 800 kB/s out, 1100 packets/s in, 1100 packets/s out
So for me, all values seem not to be at any limit.
PS: for testing of course the message queue is always full, so that both machines have enough work to do.
To find bottlenecks you typically need to measure also INSIDE the application. That means profiling the java application code and possibly what happens inside Oracle.
The good news is that you have excluded at least some possible hardware bottlenecks.

Resources