Get current disk load - bash

Since I can't use watch on iostat -dx 1 to get the current disk load, I'd like to know if there is an alternative way to do this, e.g., doing calculations with the values contained in /proc/diskstats and/or some other files.

According to kernel.org, the mapping is :
The /proc/diskstats file displays the I/O statistics
of block devices. Each line contains the following 14
fields:
1 - major number
2 - minor mumber
3 - device name
4 - reads completed successfully
5 - reads merged
6 - sectors read
7 - time spent reading (ms)
8 - writes completed
9 - writes merged
10 - sectors written
11 - time spent writing (ms)
12 - I/Os currently in progress
13 - time spent doing I/Os (ms)
14 - weighted time spent doing I/Os (ms)
For more details refer to Documentation/iostats.txt
You can use or read Sys::Statistics::Linux::DiskStats too

Related

Why Impala Scan Node is very slow (RowBatchQueueGetWaitTime)?

This query returns in 10 seconds most of the times, but occasionally it need 40 seconds or more.
There are two executer nodes in the swarm, and there is no remarkable difference between profiles of the two nodes, following is one of them:
HDFS_SCAN_NODE (id=0):(Total: 39s818ms, non-child: 39s818ms, % non-child: 100.00%)
- AverageHdfsReadThreadConcurrency: 0.07
- AverageScannerThreadConcurrency: 1.47
- BytesRead: 563.73 MB (591111366)
- BytesReadDataNodeCache: 0
- BytesReadLocal: 0
- BytesReadRemoteUnexpected: 0
- BytesReadShortCircuit: 0
- CachedFileHandlesHitCount: 0 (0)
- CachedFileHandlesMissCount: 560 (560)
- CollectionItemsRead: 0 (0)
- DecompressionTime: 1s501ms
- MaterializeTupleTime(*): 11s685ms
- MaxCompressedTextFileLength: 0
- NumColumns: 9 (9)
- NumDictFilteredRowGroups: 0 (0)
- NumDisksAccessed: 1 (1)
- NumRowGroups: 56 (56)
- NumScannerThreadMemUnavailable: 0 (0)
- NumScannerThreadReservationsDenied: 0 (0)
- NumScannerThreadsStarted: 4 (4)
- NumScannersWithNoReads: 0 (0)
- NumStatsFilteredRowGroups: 0 (0)
- PeakMemoryUsage: 142.10 MB (149004861)
- PeakScannerThreadConcurrency: 2 (2)
- PerReadThreadRawHdfsThroughput: 151.39 MB/sec
- RemoteScanRanges: 1.68K (1680)
- RowBatchBytesEnqueued: 2.32 GB (2491334455)
- RowBatchQueueGetWaitTime: 39s786ms
- RowBatchQueuePeakMemoryUsage: 1.87 MB (1959936)
- RowBatchQueuePutWaitTime: 0.000ns
- RowBatchesEnqueued: 6.38K (6377)
- RowsRead: 73.99M (73994828)
- RowsReturned: 6.40M (6401849)
- RowsReturnedRate: 161.27 K/sec
- ScanRangesComplete: 56 (56)
- ScannerThreadsInvoluntaryContextSwitches: 99 (99)
- ScannerThreadsTotalWallClockTime: 1m10s
- ScannerThreadsSysTime: 630.808ms
- ScannerThreadsUserTime: 12s824ms
- ScannerThreadsVoluntaryContextSwitches: 1.25K (1248)
- TotalRawHdfsOpenFileTime(*): 9s396ms
- TotalRawHdfsReadTime(*): 3s789ms
- TotalReadThroughput: 11.70 MB/sec
Buffer pool:
- AllocTime: 1.240ms
- CumulativeAllocationBytes: 706.32 MB (740630528)
- CumulativeAllocations: 578 (578)
- PeakReservation: 140.00 MB (146800640)
- PeakUnpinnedBytes: 0
- PeakUsedReservation: 33.83 MB (35471360)
- ReadIoBytes: 0
- ReadIoOps: 0 (0)
- ReadIoWaitTime: 0.000ns
- WriteIoBytes: 0
- WriteIoOps: 0 (0)
- WriteIoWaitTime: 0.000ns
We can notice that RowBatchQueueGetWaitTime is very high, almost 40 seconds, but I cannot figure out why, admitting that TotalRawHdfsOpenFileTime takes 9 seconds and TotalRawHdfsReadTime takes almost 4 seconds, I still cannot explain where are other 27 seconds spend on.
Can you suggest the possible issue and how can I solve it?
The threading model in the scan nodes is pretty complex because there are two layers of workers threads for scanning and I/O - I'll call them scanner and I/O threads. I'll go top down and call out some potential bottlenecks and how to identify them.
High RowBatchQueueGetWaitTime indicates that the main thread consuming from the scan is spending a lot of time waiting for the scanner threads to produce rows. One major source of variance can be the number of scanner threads - if the system is under resource pressure each query can get fewer threads. So keep an eye on AverageScannerThreadConcurrency to understand if that is varying.
The scanner threads would be spending their time doing a variety of things. The bulk of time is generally
Not running because the operating system scheduled a different thread.
Waiting for I/O threads to read data from the storage system
Decoding data, evaluating predicates, other work
With #1 you would see a higher value for ScannerThreadsInvoluntaryContextSwitches and ScannerThreadsUserTime/ScannerThreadsSysTime much lower than ScannerThreadsTotalWallClockTime. If ScannerThreadsUserTime is much lower than MaterializeTupleTime, that would be another symptom.
With #2 you would see high ScannerThreadsUserTime and MaterializeTupleTime. It looks like here there is a significant amount of CPU time going to that, but not the bulk of the time.
To identify #3, I would recommend looking at TotalStorageWaitTime in the fragment profile to understand how much time threads actually spent waiting for I/O. I also added ScannerIoWaitTime in more recent Impala releases which is more convenient since it's in the scanner profile.
If the storage wait time is slow, there are a few things to consider
If TotalRawHdfsOpenFileTime is high, it could be that opening the files is a bottleneck. This can happen on any storage system, including HDFS. See Why Impala spend a lot of time Opening HDFS File (TotalRawHdfsOpenFileTime)?
If TotalRawHdfsReadTime is high, reading from the storage system may be slow (e.g. if the data is not in the OS buffer cache or it is a remote filesystem like S3)
Other queries may be contending for I/O resources and/or I/O threads
I suspect in your case that the root cause is both slowness opening files for this query, and slowness opening files for other queries causing scanner threads to be occupied. Likely enabling file handle caching will solve the problem - we've seen dramatic improvements in performance on production deployments by doing that.
Another possibility worth mentioning is that the built-in JVM is doing some garbage collection - this could block some of the HDFS operations. We have some pause detection that logs messages when there is a JVM pause. You can also look at the /memz debug page, which I think has some GC stats. Or connect up other Java debugging tools.
ScannerThreadsVoluntaryContextSwitches: 1.25K (1248) means that there were 1248 situations were scan threads got "stuck" waiting for some external resource, and subsequently put to sleep().
Most likely that resource was disk IO. That would explain quite low average reading speed (TotalReadThroughput: *11.70 MB*/sec) while having "normal" per-read thruput (PerReadThreadRawHdfsThroughput: 151.39 MB/sec).
EDIT
To increase performance, you may want to try:
enable short circuit reads (dfs.client.read.shortcircuit=true)
configure HDFS caching and alter Impala table to use cache
(Note that both applicable if you're running Impala against HDFS, not some sort of object store.)

Performance Analysis of Multiple Kernels (CUDA C)

I have CUDA program with multiple kernels run on series (in the same stream- the default one). I want to make performance analysis for the program as a whole specifically the GPU portion. I'm doing the analysis using some metrics such as achieved_occupancy, inst_per_warp, gld_efficiency and so on using nvprof tool.
But the profiler gives metrics values separately for each kernel while I want to compute that for them all to see the total usage of the GPU for the program.
Should I take the (average or largest value or total) of all kernels for each metric??
One possible approach would be to use a weighted average method.
Suppose we had 3 non-overlapping kernels in our timeline. Let's say kernel 1 runs for 10 milliseconds, kernel 2 runs for 20 millisconds, and kernel 3 runs for 30 milliseconds. Collectively, all 3 kernels are occupying 60 milliseconds in our overall application timeline.
Let's also suppose that the profiler reports the gld_efficiency metric as follows:
kernel duration gld_efficiency
1 10ms 88%
2 20ms 76%
3 30ms 50%
You could compute the weighted average as follows:
88*10 76*20 50*30
"overall" global load efficiency = ----- + ----- + ----- = 65%
60 60 60
I'm sure there may be other approaches that make sense also. For example, a better approach might be to have the profiler report the total number of global load transaction for each kernel, and do your weighting based on that, rather than kernel duration:
kernel gld_transactions gld_efficiency
1 1000 88%
2 2000 76%
3 3000 50%
88*1000 76*2000 50*3000
"overall" global load efficiency = ------- + ------- + ------- = 65%
6000 6000 6000

File read time in c increase unexpectedly

I'm currently facing an annoying problem, I have to read a big data file (500 GO) which is stored on a SSD revodrive 350.
I read the file using fread function as big memory chunks (roughly 17 mo per chunk).
At the beginning of my program everything goes smoothly It takes 10ms for 3 chunks read. Then after 10 sec read time performances collapse and vary between 60 and 90 ms.
I don't know the reason why this is happening and if it is possible to keep read time stable ?
Thank you in advance
Rob
17 mo per chunk, 10 ms for 3 chunks -> 51 mo / 10 ms.
10 sec = 1000 x 10 ms -> 51 GO read after 10 seconds!
How much memory do you have? Is your pagefile on the same disk?
The system may swap memory!

How is CPU time measured on Windows?

I am currently creating a program which identifies processes which are hung/out-of-control, and using an entire CPU core. The program then terminates them, so the CPU usage can be kept under control.
However, I have run into a problem: When I execute the 'tasklist' command on Windows, it outputs this:
Image Name: Blockland.exe
PID: 4880
Session Name: Console
Session#: 6
Mem Usage: 127,544 K
Status: Running
User Name: [removed]\[removed]
CPU Time: 0:00:22
Window Title: C:\HammerHost\Blockland\Blockland.exe
So I know that the line which says "CPU Time" is an indication of the total time, in seconds, used by the program ever since it started.
But let's suppose there are 4 CPU cores on the system. Does this mean that it used up 22 seconds of one core, and therefore used 5.5 seconds on the entire CPU in total? Or does this mean that the process used up 22 seconds on the entire CPU?
It's the total CPU time across all cores. So, if the task used 10 seconds on one core and then 15 seconds later on a different core it would report 25 seconds. If it used 5 seconds on all four cores simultaneously, it would report 20 seconds.

how to tell if a program has been successfully parallelized?

A program run on a parallel machine is measured to have the following efficiency values for increasing numbers of processors, P.
P 1 2 3 4 5 6 7
E 100 90 85 80 70 60 50
Using the above results, plot the speedup graph.
Use the graph to explain whether or not the program has been successfully parallelized.
P E Speedup
1 100% 1
2 90% 1.8
3 85% 2.55
4 80% 3.2
5 70% 3.5
6 60% 3.6
7 50% 3.5
This is a past year exam question, and I know how to calculate the speedup & plot the graph. However I don't know how to tell a program is successfully parallelized.
Amdahl's law
I think the idea here is that not all portion can be parallelized.
For example, if a program needs 20 hours using a single processor core, and a particular portion of 1 hour cannot be parallelized, while the remaining promising portion of 19 hours (95%) can be parallelized, then regardless of how many processors we devote to a parallelized execution of this program, the minimum execution time cannot be less than that critical 1 hour. Hence the speedup is limited up to 20×
In this example, the speedup reached maximum 3.6 with 6 processors. So the parallel portion is about 1-1/3.6 is about 72.2%.

Resources