Ridiculously slow ZFS - amazon-ec2

Am I misinterpreting iostat results or is it really writing just 3.06 MB per minute?
# zpool iostat -v 60
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
zfs-backup 356G 588G 465 72 1.00M 3.11M
xvdf 356G 588G 465 72 1.00M 3.11M
---------- ----- ----- ----- ----- ----- -----
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
zfs-backup 356G 588G 568 58 1.26M 3.06M
xvdf 356G 588G 568 58 1.26M 3.06M
---------- ----- ----- ----- ----- ----- -----
Currently rsync is writing files from the other HDD (ext4). Based on our file characteristics (~50 KB files) it seems that math is correct 3.06 * 1024 / 58 = 54 KB.
For the record:
primarycache=metadata
compression=lz4
dedup=off
checksum=on
relatime=on
atime=off
Server is on the EC2, currently 1 core, 2GB RAM (t2.small), HDD - the cheapest one on amazon. OS - Debian Jessie, zfs-dkms installed from the debian testing repository.
If it's really that slow, then why? Is there a way to improve performance without moving all to SSD and adding 8 GB of RAM? Can it perform well on VPS at all, or was ZFS designed with bare metal in mind?
EDIT
I've added a 5 GB general purpose SSD to be used as ZIL, as it was suggested in the answers. That didn't help much, as ZIL doesn't seem to be used at all. 5 GB should be more than plenty in my use case, as according to the following Oracle article I should have 1/2 of the size of the RAM.
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
zfs-backup 504G 440G 47 36 272K 2.74M
xvdf 504G 440G 47 36 272K 2.74M
logs - - - - - -
xvdg 0 4.97G 0 0 0 0
---------- ----- ----- ----- ----- ----- -----
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
zfs-backup 504G 440G 44 37 236K 2.50M
xvdf 504G 440G 44 37 236K 2.50M
logs - - - - - -
xvdg 0 4.97G 0 0 0 0
---------- ----- ----- ----- ----- ----- -----
EDIT
dd test shows pretty decent speed.
# dd if=/dev/zero of=/mnt/zfs/docstore/10GB_test bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 29.3561 s, 366 MB/s
However iostat output hasn't changed much bandwidth-wise. Note higher number of write operations.
# zpool iostat -v 10
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
zfs-backup 529G 415G 0 40 1.05K 2.36M
xvdf 529G 415G 0 40 1.05K 2.36M
logs - - - - - -
xvdg 0 4.97G 0 0 0 0
---------- ----- ----- ----- ----- ----- -----
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
zfs-backup 529G 415G 2 364 3.70K 3.96M
xvdf 529G 415G 2 364 3.70K 3.96M
logs - - - - - -
xvdg 0 4.97G 0 0 0 0
---------- ----- ----- ----- ----- ----- -----
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
zfs-backup 529G 415G 0 613 0 4.48M
xvdf 529G 415G 0 613 0 4.48M
logs - - - - - -
xvdg 0 4.97G 0 0 0 0
---------- ----- ----- ----- ----- ----- -----
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
zfs-backup 529G 415G 0 490 0 3.67M
xvdf 529G 415G 0 490 0 3.67M
logs - - - - - -
xvdg 0 4.97G 0 0 0 0
---------- ----- ----- ----- ----- ----- -----
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
zfs-backup 529G 415G 0 126 0 2.77M
xvdf 529G 415G 0 126 0 2.77M
logs - - - - - -
xvdg 0 4.97G 0 0 0 0
---------- ----- ----- ----- ----- ----- -----
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
zfs-backup 529G 415G 0 29 460 1.84M
xvdf 529G 415G 0 29 460 1.84M
logs - - - - - -
xvdg 0 4.97G 0 0 0 0
---------- ----- ----- ----- ----- ----- -----

Can it perform well on VPS at all, or was ZFS designed with bare metal in mind?
Yes to both.
Originally it was designed for bare metal, and that is were you naturally get the best performance and full feature set (otherwise you have to trust the underlying storage, for example if writes are really committed to disk when requesting synchronized writes). Although it is quite flexible, as your vdevs can consist of any files or devices you have available - of course, performance can only be as good as the underlying storage.
Some points for consideration:
Moving files between different ZFS file systems is always a full copy/remove, not just rearranging of links (does not apply to your case, but may in the future)
Sync writing is much much slower than async (ZFS has to wait for every single request to be committed and cannot queue the writes in the usual fashion*), and can only be speed up by moving the ZFS intent log to a dedicated vdev suitable for high write IOPS, low latency and high endurance (in most cases this will be a SLC SSD or similar, but it could be any device different from the devices already in the pool). A system with normal disks that can easily saturate 110 MB/s async might have sync performance of about 0.5 to 10 MB/s (depending on vdevs) without separating the ZIL onto a dedicated SLOG device. Therefore I would not consider your values out of the ordinary.
Even with good hardware, ZFS will never be as fast as simpler file systems, because of overhead for flexibility and safety. This was stated from Sun from the beginning and should not surprise you. If you value performance over anything, choose something else.
Block size of the file system in question can affect performance, but I do not have reliable test numbers at hand.
More RAM will not help you much (over a low threshold of about 1 GB for the system itself), because it is used only as read cache (unless you have deduplication enabled)
Suggestions:
Use faster (virtual) disks for your pool
Separate the ZIL from your normal pool by using a different (virtual) device, preferably faster than the pool, but even a device of same speed but not linked to the other devices improves your case
Use async instead of sync and verify it after your transaction (or at sizeable chunks of it) yourself
*) To be more precise: in general all small sync writes below a certain size are additionally collected in the ZIL before being written to disk from RAM, which happens either every five seconds or about 4 GB, whichever comes first (all those parameters can be modified). This is done because:
the writing from RAM to spinning disks every 5 seconds can be continuous and is therefore faster than small writes
in case of sudden power loss, the aborted in-flight transactions are stored savely in the ZIL and can be reapplied upon reboot. This works like in a database with a transaction log and guarantees a consistent state of the file system (for old data) and also that no data to be written is los (for new data).
Normally the ZIL resides on the pool itself, which should be protected by using redundant vdevs, making the whole operation very resilient against power loss, disk crashes, bit errors etc. The downside is that the pool disks need to do the random small writes before they can flush the same data to disk in more efficient continuous transfer - therefore it is recommended to move the ZIL onto another device - usually called an SLOG device (Separate LOG device). This can be another disk, but an SSD performs much better at this workload (and will wear out pretty fast, as most transactions are going through it). If you never experience a crash, your SSD will never be read, only written to.

This particular problem may be due to a noisy neighbor. Being that its a t2 instance, you will end up with the lowest priority. In this case you can stop/start your instance to get a new host.
Unless you are using instance storage (which is not really an option for t2 instances anyway), all disk writing is done to what are essentially SAN volumes. The network interface to the EBS system is shared by all instances on the same host. The size of the instance will determine the priority of the instance.
If you are writing from one volume to another, you are passing all read and write traffic over the same interface.
There may be other factors at play depending which volume types you have and if you have any CPU credits left on your t2 instance

Related

Why Impala Scan Node is very slow (RowBatchQueueGetWaitTime)?

This query returns in 10 seconds most of the times, but occasionally it need 40 seconds or more.
There are two executer nodes in the swarm, and there is no remarkable difference between profiles of the two nodes, following is one of them:
HDFS_SCAN_NODE (id=0):(Total: 39s818ms, non-child: 39s818ms, % non-child: 100.00%)
- AverageHdfsReadThreadConcurrency: 0.07
- AverageScannerThreadConcurrency: 1.47
- BytesRead: 563.73 MB (591111366)
- BytesReadDataNodeCache: 0
- BytesReadLocal: 0
- BytesReadRemoteUnexpected: 0
- BytesReadShortCircuit: 0
- CachedFileHandlesHitCount: 0 (0)
- CachedFileHandlesMissCount: 560 (560)
- CollectionItemsRead: 0 (0)
- DecompressionTime: 1s501ms
- MaterializeTupleTime(*): 11s685ms
- MaxCompressedTextFileLength: 0
- NumColumns: 9 (9)
- NumDictFilteredRowGroups: 0 (0)
- NumDisksAccessed: 1 (1)
- NumRowGroups: 56 (56)
- NumScannerThreadMemUnavailable: 0 (0)
- NumScannerThreadReservationsDenied: 0 (0)
- NumScannerThreadsStarted: 4 (4)
- NumScannersWithNoReads: 0 (0)
- NumStatsFilteredRowGroups: 0 (0)
- PeakMemoryUsage: 142.10 MB (149004861)
- PeakScannerThreadConcurrency: 2 (2)
- PerReadThreadRawHdfsThroughput: 151.39 MB/sec
- RemoteScanRanges: 1.68K (1680)
- RowBatchBytesEnqueued: 2.32 GB (2491334455)
- RowBatchQueueGetWaitTime: 39s786ms
- RowBatchQueuePeakMemoryUsage: 1.87 MB (1959936)
- RowBatchQueuePutWaitTime: 0.000ns
- RowBatchesEnqueued: 6.38K (6377)
- RowsRead: 73.99M (73994828)
- RowsReturned: 6.40M (6401849)
- RowsReturnedRate: 161.27 K/sec
- ScanRangesComplete: 56 (56)
- ScannerThreadsInvoluntaryContextSwitches: 99 (99)
- ScannerThreadsTotalWallClockTime: 1m10s
- ScannerThreadsSysTime: 630.808ms
- ScannerThreadsUserTime: 12s824ms
- ScannerThreadsVoluntaryContextSwitches: 1.25K (1248)
- TotalRawHdfsOpenFileTime(*): 9s396ms
- TotalRawHdfsReadTime(*): 3s789ms
- TotalReadThroughput: 11.70 MB/sec
Buffer pool:
- AllocTime: 1.240ms
- CumulativeAllocationBytes: 706.32 MB (740630528)
- CumulativeAllocations: 578 (578)
- PeakReservation: 140.00 MB (146800640)
- PeakUnpinnedBytes: 0
- PeakUsedReservation: 33.83 MB (35471360)
- ReadIoBytes: 0
- ReadIoOps: 0 (0)
- ReadIoWaitTime: 0.000ns
- WriteIoBytes: 0
- WriteIoOps: 0 (0)
- WriteIoWaitTime: 0.000ns
We can notice that RowBatchQueueGetWaitTime is very high, almost 40 seconds, but I cannot figure out why, admitting that TotalRawHdfsOpenFileTime takes 9 seconds and TotalRawHdfsReadTime takes almost 4 seconds, I still cannot explain where are other 27 seconds spend on.
Can you suggest the possible issue and how can I solve it?
The threading model in the scan nodes is pretty complex because there are two layers of workers threads for scanning and I/O - I'll call them scanner and I/O threads. I'll go top down and call out some potential bottlenecks and how to identify them.
High RowBatchQueueGetWaitTime indicates that the main thread consuming from the scan is spending a lot of time waiting for the scanner threads to produce rows. One major source of variance can be the number of scanner threads - if the system is under resource pressure each query can get fewer threads. So keep an eye on AverageScannerThreadConcurrency to understand if that is varying.
The scanner threads would be spending their time doing a variety of things. The bulk of time is generally
Not running because the operating system scheduled a different thread.
Waiting for I/O threads to read data from the storage system
Decoding data, evaluating predicates, other work
With #1 you would see a higher value for ScannerThreadsInvoluntaryContextSwitches and ScannerThreadsUserTime/ScannerThreadsSysTime much lower than ScannerThreadsTotalWallClockTime. If ScannerThreadsUserTime is much lower than MaterializeTupleTime, that would be another symptom.
With #2 you would see high ScannerThreadsUserTime and MaterializeTupleTime. It looks like here there is a significant amount of CPU time going to that, but not the bulk of the time.
To identify #3, I would recommend looking at TotalStorageWaitTime in the fragment profile to understand how much time threads actually spent waiting for I/O. I also added ScannerIoWaitTime in more recent Impala releases which is more convenient since it's in the scanner profile.
If the storage wait time is slow, there are a few things to consider
If TotalRawHdfsOpenFileTime is high, it could be that opening the files is a bottleneck. This can happen on any storage system, including HDFS. See Why Impala spend a lot of time Opening HDFS File (TotalRawHdfsOpenFileTime)?
If TotalRawHdfsReadTime is high, reading from the storage system may be slow (e.g. if the data is not in the OS buffer cache or it is a remote filesystem like S3)
Other queries may be contending for I/O resources and/or I/O threads
I suspect in your case that the root cause is both slowness opening files for this query, and slowness opening files for other queries causing scanner threads to be occupied. Likely enabling file handle caching will solve the problem - we've seen dramatic improvements in performance on production deployments by doing that.
Another possibility worth mentioning is that the built-in JVM is doing some garbage collection - this could block some of the HDFS operations. We have some pause detection that logs messages when there is a JVM pause. You can also look at the /memz debug page, which I think has some GC stats. Or connect up other Java debugging tools.
ScannerThreadsVoluntaryContextSwitches: 1.25K (1248) means that there were 1248 situations were scan threads got "stuck" waiting for some external resource, and subsequently put to sleep().
Most likely that resource was disk IO. That would explain quite low average reading speed (TotalReadThroughput: *11.70 MB*/sec) while having "normal" per-read thruput (PerReadThreadRawHdfsThroughput: 151.39 MB/sec).
EDIT
To increase performance, you may want to try:
enable short circuit reads (dfs.client.read.shortcircuit=true)
configure HDFS caching and alter Impala table to use cache
(Note that both applicable if you're running Impala against HDFS, not some sort of object store.)

Performance Analysis of Multiple Kernels (CUDA C)

I have CUDA program with multiple kernels run on series (in the same stream- the default one). I want to make performance analysis for the program as a whole specifically the GPU portion. I'm doing the analysis using some metrics such as achieved_occupancy, inst_per_warp, gld_efficiency and so on using nvprof tool.
But the profiler gives metrics values separately for each kernel while I want to compute that for them all to see the total usage of the GPU for the program.
Should I take the (average or largest value or total) of all kernels for each metric??
One possible approach would be to use a weighted average method.
Suppose we had 3 non-overlapping kernels in our timeline. Let's say kernel 1 runs for 10 milliseconds, kernel 2 runs for 20 millisconds, and kernel 3 runs for 30 milliseconds. Collectively, all 3 kernels are occupying 60 milliseconds in our overall application timeline.
Let's also suppose that the profiler reports the gld_efficiency metric as follows:
kernel duration gld_efficiency
1 10ms 88%
2 20ms 76%
3 30ms 50%
You could compute the weighted average as follows:
88*10 76*20 50*30
"overall" global load efficiency = ----- + ----- + ----- = 65%
60 60 60
I'm sure there may be other approaches that make sense also. For example, a better approach might be to have the profiler report the total number of global load transaction for each kernel, and do your weighting based on that, rather than kernel duration:
kernel gld_transactions gld_efficiency
1 1000 88%
2 2000 76%
3 3000 50%
88*1000 76*2000 50*3000
"overall" global load efficiency = ------- + ------- + ------- = 65%
6000 6000 6000

Is there a way, using wmic, to reverse engineer which volume maps to which partition(s)? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 6 years ago.
Improve this question
Problem.. I only have access to wmic... Lame I know.. but need to figure out which volume corresponds to what partition(s) which correspond to what disk.. I know how to correspond which partition corresponds to what disk because the disk id is directly in the results of the wmic query. However, the first part of the problem is more difficult. How to correlate which volume belongs to which partitions?..
Is there a way, using wmic, to reverse engineer which volume maps to which partition(s)?
If so how would this query look?
wmic logicaldisk get name, volumename
for more info use wmic logicaldisk get /?
The easiest way to do this is with diskpart from a command prompt:
C:\>diskpart
Microsoft DiskPart version 10.0.10586
Copyright (C) 1999-2013 Microsoft Corporation.
On computer: TIMSPC
DISKPART> select disk 0
Disk 0 is now the selected disk.
DISKPART> detail disk
HGST HTS725050A7E630 *(Note: This is the Model of my hard disk)*
Disk ID: 00C942C7
Type : SATA
Status : Online
Path : 0
Target : 0
LUN ID : 0
Location Path : PCIROOT(0)#PCI(1F02)#ATA(C00T00L00)
Current Read-only State : No
Read-only : No
Boot Disk : Yes
Pagefile Disk : Yes
Hibernation File Disk : No
Crashdump Disk : Yes
Clustered Disk : No
Volume ### Ltr Label Fs Type Size Status Info
---------- --- ----------- ----- ---------- ------- --------- --------
Volume 0 System NTFS Partition 350 MB Healthy System
Volume 1 C OSDisk NTFS Partition 464 GB Healthy Boot
Volume 2 NTFS Partition 843 MB Healthy Hidden
DISKPART> exit
Leaving DiskPart...
C:\>
You have access to a command line since you have access to WMIC, so this method should work.
Based on the comments below:
No, there is no way to use WMIC to determine with 100% accuracy exactly which partition corresponds to which partition on a specific drive. The problem with determining this information via WMI is that not all drives are basic drives. Some disks may be dynamic disks containing a RAID volume that spans multiple drives. Some may be a complete hardware implemented abstraction like a storage array (for example, a p410i RAID controller in an HP ProLiant). In addition, there are multiple partitioning schemes (eg UEFI/GPT vs BIOS/MBR). WMI is however independent of its environment. That is, it doesn't care about the hardware. It is simply another form of abstraction that provides a common interface model that unifies and extends existing instrumentation and management standards.
To get the level of detail you desire will require a tool that can interface at a much lower level like the driver for the device and hope that the driver provides the information you need. If it doesn't, you will be looking at very low level programming to interface with the device itself...essentially creating a new driver that provides the information you want. But based on your limitation of only having command line access, Diskpart is the closest prebuilt tool you will find.
There are volumes which do not have traditional letters.
And? Diskpart can select disk, partitions, and volumes based on the number assigned. The drive letter is irrelevant.
At no point in disk part is any kind of ID listed which allows a user to 100% know which partition they are dealing with when they reference a volume.
Here is an example from one of my servers with two 500gb hard drives. The first in the Boot/OS drive. The second has 2gb of unallocated space.
DISKPART> list volume
Volume ### Ltr Label Fs Type Size Status Info
---------- --- ----------- ----- ---------- ------- --------- ------
Volume 0 System NTFS Partition 350 MB Healthy System
Volume 1 C OSDisk NTFS Partition 465 GB Healthy Boot
Volume 2 D New Volume NTFS Partition 463 GB Healthy
DISKPART> select volume 2
Volume 2 is the selected volume.
DISKPART> list disk
Disk ### Status Size Free Dyn Gpt
-------- ------------- ------- ------- --- ---
Disk 0 Online 465 GB 0 B
* Disk 1 Online 465 GB 2049 MB
DISKPART> list partition
Partition ### Type Size Offset
------------- ---------------- ------- -------
* Partition 1 Primary 463 GB 1024 KB
DISKPART> list volume
Volume ### Ltr Label Fs Type Size Status Info
---------- --- ----------- ----- ---------- ------- --------- ------
Volume 0 System NTFS Partition 350 MB Healthy System
Volume 1 C OSDisk NTFS Partition 465 GB Healthy Boot
* Volume 2 D New Volume NTFS Partition 463 GB Healthy
DISKPART>
Notice the asterisks? Those denote the active disk, partition, and volume. While these are not the ID you require to allow a user to 100% know which partition they are dealing with, you can at least clearly see that Volume 2 (D:) is on Partition 1 of Disk 1.
There are volumes that are RAW disks which is essentially saying.. this is a raw disk and I want to find out where these raw disks are at.
As you can see after I have created a volume with no file system on the 2gb of free space, this does not make any difference:
DISKPART> list volume
Volume ### Ltr Label Fs Type Size Status Info
---------- --- ----------- ----- ---------- ------- --------- -------
Volume 0 System NTFS Partition 350 MB Healthy System
Volume 1 C OSDisk NTFS Partition 465 GB Healthy Boot
Volume 2 D New Volume NTFS Partition 463 GB Healthy
Volume 3 RAW Partition 2048 MB Healthy
DISKPART> select volume 3
Volume 3 is the selected volume.
DISKPART> list volume
Volume ### Ltr Label Fs Type Size Status Info
---------- --- ----------- ----- ---------- ------- --------- -------
Volume 0 System NTFS Partition 350 MB Healthy System
Volume 1 C OSDisk NTFS Partition 465 GB Healthy Boot
Volume 2 D New Volume NTFS Partition 463 GB Healthy
* Volume 3 RAW Partition 2048 MB Healthy
DISKPART> list partition
Partition ### Type Size Offset
------------- ---------------- ------- -------
Partition 1 Primary 463 GB 1024 KB
* Partition 2 Primary 2048 MB 463 GB
DISKPART> list disk
Disk ### Status Size Free Dyn Gpt
-------- ------------- ------- ------- --- ---
Disk 0 Online 465 GB 0 B
* Disk 1 Online 465 GB 1024 KB
The reason that I am using wmic is because I need to script out many disk ops. Have you ever tried to script out getting information from diskpart?
No, but it is scriptable.
In your sample data, you can enumerate the disk, volumes, and partitions. By looping through each object and selecting it, you can create a map of which volume is on which partition and which drive contains that partition. Diskpart may not provide 100% of the data you need 100% of the time with 100% of the accuracy you want, but it is the closest command line tool you are going to find to meet your goal.

Total Cache misses fewer than data cache misses (PAPI_L1_DCM > PAPI_L1_TCM)

For my application (SpMV) I have more data cache misses (PAPI_L1_DCM) than total cache misses (PAPI_L1_TCM) in level 1 cache. How can that be? For Level 2 the values are ok. That is, what the PAPI counters offer:
[PAPI_L1_ICM ][PAPI_L1_DCM ][PAPI_L1_TCM ][PAPI_L2_ICM ][PAPI_L2_DCM ][PAPI_L2_TCM ]
1256 3388225 1442386 1007 2389903 2390908
Furthermore, I have the case that my cache accesses are below the cache misses of a level. I can't explain it by myself.
[PAPI_L2_TCA ][PAPI_L2_TCM ][PAPI_L2_DCA ][PAPI_L2_DCM ]
1427361 2367210 1456111 2326503
Maybe the papi_avail output can explain. It would be also good to know the exact explanation of the addressed PAPI counters by Intel, but I didn't found it in the Manual: http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html
$ papi_avail
Available PAPI preset and user defined events plus hardware information.
--------------------------------------------------------------------------------
PAPI Version : 5.4.1.0
Vendor string and code : GenuineIntel (1)
Model string and code : Intel(R) Xeon(R) CPU E5-2680 v3 # 2.50GHz (63)
CPU Revision : 2.000000
CPUID Info : Family: 6 Model: 63 Stepping: 2
CPU Max Megahertz : 2501
CPU Min Megahertz : 1200
Hdw Threads per core : 2
Cores per Socket : 12
Sockets : 2
NUMA Nodes : 2
CPUs per Node : 24
Total CPUs : 48
Running in a VM : no
Number Hardware Counters : 11
Max Multiplex Counters : 32
--------------------------------------------------------------------------------
================================================================================
PAPI Preset Events
================================================================================
Name Code Avail Deriv Description (Note)
PAPI_L1_DCM 0x80000000 Yes No Level 1 data cache misses
PAPI_L1_ICM 0x80000001 Yes No Level 1 instruction cache misses
PAPI_L2_DCM 0x80000002 Yes Yes Level 2 data cache misses
PAPI_L2_ICM 0x80000003 Yes No Level 2 instruction cache misses
PAPI_L3_DCM 0x80000004 No No Level 3 data cache misses
PAPI_L3_ICM 0x80000005 No No Level 3 instruction cache misses
PAPI_L1_TCM 0x80000006 Yes Yes Level 1 cache misses
PAPI_L2_TCM 0x80000007 Yes No Level 2 cache misses
PAPI_L3_TCM 0x80000008 Yes No Level 3 cache misses
PAPI_CA_SNP 0x80000009 Yes No Requests for a snoop
PAPI_CA_SHR 0x8000000a Yes No Requests for exclusive access to shared cache line
PAPI_CA_CLN 0x8000000b Yes No Requests for exclusive access to clean cache line
PAPI_CA_INV 0x8000000c Yes No Requests for cache line invalidation
PAPI_CA_ITV 0x8000000d Yes No Requests for cache line intervention
PAPI_L3_LDM 0x8000000e Yes No Level 3 load misses
PAPI_L3_STM 0x8000000f No No Level 3 store misses
PAPI_BRU_IDL 0x80000010 No No Cycles branch units are idle
PAPI_FXU_IDL 0x80000011 No No Cycles integer units are idle
PAPI_FPU_IDL 0x80000012 No No Cycles floating point units are idle
PAPI_LSU_IDL 0x80000013 No No Cycles load/store units are idle
PAPI_TLB_DM 0x80000014 Yes Yes Data translation lookaside buffer misses
PAPI_TLB_IM 0x80000015 Yes No Instruction translation lookaside buffer misses
PAPI_TLB_TL 0x80000016 No No Total translation lookaside buffer misses
PAPI_L1_LDM 0x80000017 Yes No Level 1 load misses
PAPI_L1_STM 0x80000018 Yes No Level 1 store misses
PAPI_L2_LDM 0x80000019 Yes No Level 2 load misses
PAPI_L2_STM 0x8000001a Yes No Level 2 store misses
PAPI_BTAC_M 0x8000001b No No Branch target address cache misses
PAPI_PRF_DM 0x8000001c Yes No Data prefetch cache misses
PAPI_L3_DCH 0x8000001d No No Level 3 data cache hits
PAPI_TLB_SD 0x8000001e No No Translation lookaside buffer shootdowns
PAPI_CSR_FAL 0x8000001f No No Failed store conditional instructions
PAPI_CSR_SUC 0x80000020 No No Successful store conditional instructions
PAPI_CSR_TOT 0x80000021 No No Total store conditional instructions
PAPI_MEM_SCY 0x80000022 No No Cycles Stalled Waiting for memory accesses
PAPI_MEM_RCY 0x80000023 No No Cycles Stalled Waiting for memory Reads
PAPI_MEM_WCY 0x80000024 Yes No Cycles Stalled Waiting for memory writes
PAPI_STL_ICY 0x80000025 Yes No Cycles with no instruction issue
PAPI_FUL_ICY 0x80000026 Yes Yes Cycles with maximum instruction issue
PAPI_STL_CCY 0x80000027 Yes No Cycles with no instructions completed
PAPI_FUL_CCY 0x80000028 Yes No Cycles with maximum instructions completed
PAPI_HW_INT 0x80000029 No No Hardware interrupts
PAPI_BR_UCN 0x8000002a Yes Yes Unconditional branch instructions
PAPI_BR_CN 0x8000002b Yes No Conditional branch instructions
PAPI_BR_TKN 0x8000002c Yes Yes Conditional branch instructions taken
PAPI_BR_NTK 0x8000002d Yes No Conditional branch instructions not taken
PAPI_BR_MSP 0x8000002e Yes No Conditional branch instructions mispredicted
PAPI_BR_PRC 0x8000002f Yes Yes Conditional branch instructions correctly predicted
PAPI_FMA_INS 0x80000030 No No FMA instructions completed
PAPI_TOT_IIS 0x80000031 No No Instructions issued
PAPI_TOT_INS 0x80000032 Yes No Instructions completed
PAPI_INT_INS 0x80000033 No No Integer instructions
PAPI_FP_INS 0x80000034 No No Floating point instructions
PAPI_LD_INS 0x80000035 Yes No Load instructions
PAPI_SR_INS 0x80000036 Yes No Store instructions
PAPI_BR_INS 0x80000037 Yes No Branch instructions
PAPI_VEC_INS 0x80000038 No No Vector/SIMD instructions (could include integer)
PAPI_RES_STL 0x80000039 Yes No Cycles stalled on any resource
PAPI_FP_STAL 0x8000003a No No Cycles the FP unit(s) are stalled
PAPI_TOT_CYC 0x8000003b Yes No Total cycles
PAPI_LST_INS 0x8000003c Yes Yes Load/store instructions completed
PAPI_SYC_INS 0x8000003d No No Synchronization instructions completed
PAPI_L1_DCH 0x8000003e No No Level 1 data cache hits
PAPI_L2_DCH 0x8000003f No No Level 2 data cache hits
PAPI_L1_DCA 0x80000040 No No Level 1 data cache accesses
PAPI_L2_DCA 0x80000041 Yes No Level 2 data cache accesses
PAPI_L3_DCA 0x80000042 Yes Yes Level 3 data cache accesses
PAPI_L1_DCR 0x80000043 No No Level 1 data cache reads
PAPI_L2_DCR 0x80000044 Yes No Level 2 data cache reads
PAPI_L3_DCR 0x80000045 Yes No Level 3 data cache reads
PAPI_L1_DCW 0x80000046 No No Level 1 data cache writes
PAPI_L2_DCW 0x80000047 Yes No Level 2 data cache writes
PAPI_L3_DCW 0x80000048 Yes No Level 3 data cache writes
PAPI_L1_ICH 0x80000049 No No Level 1 instruction cache hits
PAPI_L2_ICH 0x8000004a Yes No Level 2 instruction cache hits
PAPI_L3_ICH 0x8000004b No No Level 3 instruction cache hits
PAPI_L1_ICA 0x8000004c No No Level 1 instruction cache accesses
PAPI_L2_ICA 0x8000004d Yes No Level 2 instruction cache accesses
PAPI_L3_ICA 0x8000004e Yes No Level 3 instruction cache accesses
PAPI_L1_ICR 0x8000004f No No Level 1 instruction cache reads
PAPI_L2_ICR 0x80000050 Yes No Level 2 instruction cache reads
PAPI_L3_ICR 0x80000051 Yes No Level 3 instruction cache reads
PAPI_L1_ICW 0x80000052 No No Level 1 instruction cache writes
PAPI_L2_ICW 0x80000053 No No Level 2 instruction cache writes
PAPI_L3_ICW 0x80000054 No No Level 3 instruction cache writes
PAPI_L1_TCH 0x80000055 No No Level 1 total cache hits
PAPI_L2_TCH 0x80000056 No No Level 2 total cache hits
PAPI_L3_TCH 0x80000057 No No Level 3 total cache hits
PAPI_L1_TCA 0x80000058 No No Level 1 total cache accesses
PAPI_L2_TCA 0x80000059 Yes Yes Level 2 total cache accesses
PAPI_L3_TCA 0x8000005a Yes No Level 3 total cache accesses
PAPI_L1_TCR 0x8000005b No No Level 1 total cache reads
PAPI_L2_TCR 0x8000005c Yes Yes Level 2 total cache reads
PAPI_L3_TCR 0x8000005d Yes Yes Level 3 total cache reads
PAPI_L1_TCW 0x8000005e No No Level 1 total cache writes
PAPI_L2_TCW 0x8000005f Yes No Level 2 total cache writes
PAPI_L3_TCW 0x80000060 Yes No Level 3 total cache writes
PAPI_FML_INS 0x80000061 No No Floating point multiply instructions
PAPI_FAD_INS 0x80000062 No No Floating point add instructions
PAPI_FDV_INS 0x80000063 No No Floating point divide instructions
PAPI_FSQ_INS 0x80000064 No No Floating point square root instructions
PAPI_FNV_INS 0x80000065 No No Floating point inverse instructions
PAPI_FP_OPS 0x80000066 No No Floating point operations
PAPI_SP_OPS 0x80000067 No No Floating point operations; optimized to count scaled single precision vector operations
PAPI_DP_OPS 0x80000068 No No Floating point operations; optimized to count scaled double precision vector operations
PAPI_VEC_SP 0x80000069 No No Single precision vector/SIMD instructions
PAPI_VEC_DP 0x8000006a No No Double precision vector/SIMD instructions
PAPI_REF_CYC 0x8000006b Yes No Reference clock cycles
================================================================================
User Defined Events
================================================================================
Name Code Avail Deriv Description (Note)
--------------------------------------------------------------------------------
Of 108 possible events, 56 are available, of which 12 are derived.
avail.c PASSED
I found an explanation of the first problem. (PAPI_L1_DCM > PAPI_L1_TCM):
The referenced native event is the L1D:REPLACEMENT which "Counts the number of lines brought into the L1 data cache." source
$ papi_avail -e PAPI_L1_DCM
Available PAPI preset and user defined events plus hardware information.
--------------------------------------------------------------------------------
PAPI Version : 5.4.1.0
Vendor string and code : GenuineIntel (1)
Model string and code : Intel(R) Xeon(R) CPU E5-2680 v3 # 2.50GHz (63)
CPU Revision : 2.000000
CPUID Info : Family: 6 Model: 63 Stepping: 2
CPU Max Megahertz : 2501
CPU Min Megahertz : 1200
Hdw Threads per core : 2
Cores per Socket : 12
Sockets : 2
NUMA Nodes : 2
CPUs per Node : 24
Total CPUs : 48
Running in a VM : no
Number Hardware Counters : 11
Max Multiplex Counters : 32
--------------------------------------------------------------------------------
Event name: PAPI_L1_DCM
Event Code: 0x80000000
Number of Native Events: 1
Short Description: |L1D cache misses|
Long Description: |Level 1 data cache misses|
Developer's Notes: ||
Derived Type: |NOT_DERIVED|
Postfix Processing String: ||
Native Code[0]: 0x40000006 |L1D:REPLACEMENT|
Number of Register Values: 0
Native Event Description: |L1D cache, masks:L1D Data line replacements|
--------------------------------------------------------------------------------
avail.c PASSED
I can't explain the second question (PAPI_L2_TCA < PAPI_L2_TCM). It could have to do with the speculation fetches for the L2 cache. The native events are L2_RQSTS:ALL_DEMAND_REFERENCES (PAPI_L2_TCA) and LLC_REFERENCES (PAPI_L2_TCM).

Get current disk load

Since I can't use watch on iostat -dx 1 to get the current disk load, I'd like to know if there is an alternative way to do this, e.g., doing calculations with the values contained in /proc/diskstats and/or some other files.
According to kernel.org, the mapping is :
The /proc/diskstats file displays the I/O statistics
of block devices. Each line contains the following 14
fields:
1 - major number
2 - minor mumber
3 - device name
4 - reads completed successfully
5 - reads merged
6 - sectors read
7 - time spent reading (ms)
8 - writes completed
9 - writes merged
10 - sectors written
11 - time spent writing (ms)
12 - I/Os currently in progress
13 - time spent doing I/Os (ms)
14 - weighted time spent doing I/Os (ms)
For more details refer to Documentation/iostats.txt
You can use or read Sys::Statistics::Linux::DiskStats too

Resources