Impala scan MapR-FS slow - hadoop

I recently installed Impala on a 3-node MapR cluster. When I run a simple query.The performance is not as good as Impala + HDFS. Here is the query:
SELECT *
FROM ft_test, ft_wafer
WHERE ft_test_parquet.id = ft_wafer_parquet.id
and month = 1
and day = 8
and param = 2913;
It took about 3s. But when using the same query but with HDFS. It takes less than 1 sec for a 30Gb table size.
Here is the query profile:
Query Runtime Profile:
Query (id=dc4c084615fbf9bb:4261466f00000000):
Summary:
Session ID: 5d4edbf63653cdf6:1a59ff5354c9d4bd
Session Type: BEESWAX
Start Time: 2017-05-25 16:31:25.121391000
End Time: 2017-05-25 16:31:28.584404000
Query Type: QUERY
Query State: FINISHED
Query Status: OK
Impala Version: impalad version 2.7.0 RELEASE (build a535b583202c4a81080098a10f952d377af1949d)
User: root
Connected User: root
Delegated User:
Network Address: ::ffff:127.0.0.1:58546
Default Db: inspex
Sql Statement: select *
FROM ft_test_partition, ft_wafer_parquet
WHERE ft_test_partition.file = ft_wafer_parquet.file
and month = 1
and day = 8
and param = 2913 limit 100
Coordinator: mapr1:22000
Query Options (non default):
Plan:
----------------
Estimated Per-Host Requirements: Memory=704.67MB VCores=2
04:EXCHANGE [UNPARTITIONED]
| limit: 100
| hosts=1 per-host-mem=unavailable
| tuple-ids=1,0 row-size=800B cardinality=1
|
02:HASH JOIN [INNER JOIN, BROADCAST]
| hash predicates: ft_wafer_parquet.file = ft_test_partition.file
| runtime filters: RF000 <- ft_test_partition.file
| limit: 100
| hosts=1 per-host-mem=690.00KB
| tuple-ids=1,0 row-size=800B cardinality=1
|
|--03:EXCHANGE [BROADCAST]
| | hosts=1 per-host-mem=0B
| | tuple-ids=0 row-size=78B cardinality=8235
| |
| 00:SCAN HDFS [inspex.ft_test_partition, RANDOM]
| partitions=1/29 files=1 size=171.69MB
| predicates: param = 2913
| table stats: 813365826 rows total
| column stats: all
| hosts=1 per-host-mem=704.00MB
| tuple-ids=0 row-size=78B cardinality=8235
|
01:SCAN HDFS [inspex.ft_wafer_parquet, RANDOM]
partitions=1/1 files=1 size=66.83KB
runtime filters: RF000 -> ft_wafer_parquet.file
table stats: 1500 rows total
column stats: all
hosts=1 per-host-mem=192.00MB
tuple-ids=1 row-size=722B cardinality=1500
----------------
Estimated Per-Host Mem: 738904067
Estimated Per-Host VCores: 2
Request Pool: default-pool
Admission result: Admitted immediately
ExecSummary:
Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail
----------------------------------------------------------------------------------------------------------------------
04:EXCHANGE 1 0.000ns 0.000ns 100 1 0 -1.00 B UNPARTITIONED
02:HASH JOIN 1 42.999ms 42.999ms 0 1 3.29 MB 690.00 KB INNER JOIN, BROADCAST
|--03:EXCHANGE 1 999.990us 999.990us 9.15K 8.23K 0 0 BROADCAST
| 00:SCAN HDFS 1 2s973ms 2s973ms 9.15K 8.23K 106.05 MB 704.00 MB inspex.ft_test_partition
01:SCAN HDFS 1 16.999ms 16.999ms 1.02K 1.50K 1.78 MB 192.00 MB inspex.ft_wafer_parquet
Planner Timeline: 15.315ms
- Analysis finished: 5.081ms (5.081ms)
- Equivalence classes computed: 5.601ms (519.374us)
- Single node plan created: 9.054ms (3.453ms)
- Runtime filters computed: 9.409ms (354.377us)
- Distributed plan created: 11.507ms (2.098ms)
- Planning finished: 15.315ms (3.808ms)
Query Timeline: 3s463ms
- Start execution: 0.000ns (0.000ns)
- Planning finished: 17.999ms (17.999ms)
- Submit for admission: 17.999ms (0.000ns)
- Completed admission: 17.999ms (0.000ns)
- Ready to start 2 remote fragments: 18.999ms (999.990us)
- All 2 remote fragments started: 19.999ms (999.990us)
- Rows available: 3s246ms (3s226ms)
- First row fetched: 3s346ms (99.999ms)
- Unregister query: 3s462ms (115.998ms)
- ComputeScanRangeAssignmentTimer: 0.000ns
ImpalaServer:
- ClientFetchWaitTimer: 214.997ms
- RowMaterializationTimer: 999.990us
Execution Profile dc4c084615fbf9bb:4261466f00000000:(Total: 3s228ms, non-child: 0.000ns, % non-child: 0.00%)
Number of filters: 1
Filter routing table:
ID Src. Node Tgt. Node(s) Targets Target type Partition filter Pending (Expected) First arrived Completed Enabled
----------------------------------------------------------------------------------------------------------------------------
0 2 1 1 LOCAL false 0 (1) N/A N/A true
Fragment start latencies: Count: 2, 25th %-ile: 0, 50th %-ile: 0, 75th %-ile: 1ms, 90th %-ile: 1ms, 95th %-ile: 1ms, 99.9th %-ile: 1ms
Final filter table:
ID Src. Node Tgt. Node(s) Targets Target type Partition filter Pending (Expected) First arrived Completed Enabled
----------------------------------------------------------------------------------------------------------------------------
0 2 1 1 LOCAL false 0 (1) N/A N/A true
Per Node Peak Memory Usage: mapr1:22000(108.65 MB)
- FiltersReceived: 0 (0)
- FinalizationTimer: 0.000ns
Coordinator Fragment F02:(Total: 3s226ms, non-child: 0.000ns, % non-child: 0.00%)
MemoryUsage(500.000ms): 16.00 KB, 16.00 KB, 16.00 KB, 16.00 KB, 16.00 KB, 16.00 KB, 16.00 KB
- AverageThreadTokens: 0.00
- BloomFilterBytes: 0
- PeakMemoryUsage: 209.83 KB (214864)
- PerHostPeakMemUsage: 0
- PrepareTime: 0.000ns
- RowsProduced: 0 (0)
- TotalCpuTime: 101.999ms
- TotalNetworkReceiveTime: 3s226ms
- TotalNetworkSendTime: 0.000ns
- TotalStorageWaitTime: 0.000ns
BlockMgr:
- BlockWritesOutstanding: 0 (0)
- BlocksCreated: 48 (48)
- BlocksRecycled: 0 (0)
- BufferedPins: 0 (0)
- BytesWritten: 0
- MaxBlockSize: 8.00 MB (8388608)
- MemoryLimit: 12.21 GB (13111148544)
- PeakMemoryUsage: 256.00 KB (262144)
- TotalBufferWaitTime: 0.000ns
- TotalEncryptionTime: 0.000ns
- TotalIntegrityCheckTime: 0.000ns
- TotalReadBlockTime: 0.000ns
EXCHANGE_NODE (id=4):(Total: 3s226ms, non-child: 0.000ns, % non-child: 0.00%)
BytesReceived(500.000ms): 0, 0, 0, 0, 0, 0, 0
- BytesReceived: 61.05 KB (62513)
- ConvertRowBatchTime: 0.000ns
- DeserializeRowBatchTimer: 0.000ns
- FirstBatchArrivalWaitTime: 3s226ms
- PeakMemoryUsage: 0
- RowsReturned: 100 (100)
- RowsReturnedRate: 30.00 /sec
- SendersBlockedTimer: 0.000ns
- SendersBlockedTotalTimer(*): 0.000ns
Averaged Fragment F00:(Total: 3s001ms, non-child: 0.000ns, % non-child: 0.00%)
split sizes: min: 66.83 KB, max: 66.83 KB, avg: 66.83 KB, stddev: 0
completion times: min:3s227ms max:3s227ms mean: 3s227ms stddev:0.000ns
execution rates: min:20.70 KB/sec max:20.70 KB/sec mean:20.70 KB/sec stddev:0.00 /sec
num instances: 1
- AverageThreadTokens: 1.86
- BloomFilterBytes: 1.00 MB (1048576)
- PeakMemoryUsage: 5.07 MB (5320864)
- PerHostPeakMemUsage: 108.65 MB (113924736)
- PrepareTime: 38.999ms
- RowsProduced: 1.02K (1024)
- TotalCpuTime: 3s232ms
- TotalNetworkReceiveTime: 2s940ms
- TotalNetworkSendTime: 0.000ns
- TotalStorageWaitTime: 13.999ms
CodeGen:(Total: 262.997ms, non-child: 262.997ms, % non-child: 100.00%)
- CodegenTime: 999.990us
- CompileTime: 73.999ms
- LoadTime: 0.000ns
- ModuleBitcodeSize: 1.86 MB (1953028)
- NumFunctions: 85 (85)
- NumInstructions: 2.86K (2857)
- OptimizationTime: 151.998ms
- PrepareTime: 36.999ms
DataStreamSender (dst_id=4):(Total: 999.990us, non-child: 999.990us, % non-child: 100.00%)
- BytesSent: 61.05 KB (62513)
- NetworkThroughput(*): 0.00 /sec
- OverallThroughput: 59.62 MB/sec
- RowsReturned: 1.02K (1024)
- SerializeBatchTime: 999.990us
- TransmitDataRPCTime: 0.000ns
- UncompressedRowBatchSize: 185.83 KB (190290)
HASH_JOIN_NODE (id=2):(Total: 3s001ms, non-child: 42.999ms, % non-child: 1.43%)
- BuildPartitionTime: 1.999ms
- BuildRows: 9.15K (9153)
- BuildRowsPartitioned: 9.15K (9153)
- BuildTime: 0.000ns
- GetNewBlockTime: 0.000ns
- HashBuckets: 16.38K (16384)
- HashCollisions: 0 (0)
- LargestPartitionPercent: 6 (6)
- MaxPartitionLevel: 0 (0)
- NumRepartitions: 0 (0)
- PartitionsCreated: 16 (16)
- PeakMemoryUsage: 3.29 MB (3445888)
- PinTime: 0.000ns
- ProbeRows: 1.02K (1024)
- ProbeRowsPartitioned: 0 (0)
- ProbeTime: 0.000ns
- RowsReturned: 0 (0)
- RowsReturnedRate: 0
- SpilledPartitions: 0 (0)
- UnpinTime: 0.000ns
EXCHANGE_NODE (id=3):(Total: 2s941ms, non-child: 2s941ms, % non-child: 100.00%)
- BytesReceived: 314.85 KB (322407)
- ConvertRowBatchTime: 0.000ns
- DeserializeRowBatchTimer: 0.000ns
- FirstBatchArrivalWaitTime: 0.000ns
- PeakMemoryUsage: 0
- RowsReturned: 9.15K (9153)
- RowsReturnedRate: 3.11 K/sec
- SendersBlockedTimer: 0.000ns
- SendersBlockedTotalTimer(*): 0.000ns
HDFS_SCAN_NODE (id=1):(Total: 16.999ms, non-child: 16.999ms, % non-child: 100.00%)
- AverageHdfsReadThreadConcurrency: 0.00
- AverageScannerThreadConcurrency: 0.00
- BytesRead: 128.51 KB (131593)
- BytesReadDataNodeCache: 0
- BytesReadLocal: 0
- BytesReadRemoteUnexpected: 0
- BytesReadShortCircuit: 0
- DecompressionTime: 0.000ns
- MaxCompressedTextFileLength: 0
- NumColumns: 49 (49)
- NumDisksAccessed: 0 (0)
- NumRowGroups: 1 (1)
- NumScannerThreadsStarted: 1 (1)
- PeakMemoryUsage: 1.78 MB (1866400)
- PerReadThreadRawHdfsThroughput: 4.18 MB/sec
- RemoteScanRanges: 0 (0)
- RowsRead: 1.50K (1500)
- RowsReturned: 1.02K (1024)
- RowsReturnedRate: 60.23 K/sec
- ScanRangesComplete: 1 (1)
- ScannerThreadsInvoluntaryContextSwitches: 0 (0)
- ScannerThreadsTotalWallClockTime: 14.999ms
- MaterializeTupleTime(*): 999.990us
- ScannerThreadsSysTime: 0.000ns
- ScannerThreadsUserTime: 2.216ms
- ScannerThreadsVoluntaryContextSwitches: 27 (27)
- TotalRawHdfsReadTime(*): 29.999ms
- TotalReadThroughput: 0.00 /sec
Filter 0 (1.00 MB):
- Rows processed: 1.50K (1500)
- Rows rejected: 3 (3)
- Rows total: 1.50K (1500)
Averaged Fragment F01:(Total: 3s191ms, non-child: 160.998ms, % non-child: 5.04%)
split sizes: min: 171.69 MB, max: 171.69 MB, avg: 171.69 MB, stddev: 0
completion times: min:3s210ms max:3s210ms mean: 3s210ms stddev:0.000ns
execution rates: min:53.47 MB/sec max:53.47 MB/sec mean:53.47 MB/sec stddev:0.00 /sec
num instances: 1
- AverageThreadTokens: 1.86
- BloomFilterBytes: 0
- PeakMemoryUsage: 106.05 MB (111206232)
- PerHostPeakMemUsage: 108.65 MB (113924736)
- PrepareTime: 33.999ms
- RowsProduced: 9.15K (9153)
- TotalCpuTime: 6s330ms
- TotalNetworkReceiveTime: 0.000ns
- TotalNetworkSendTime: 0.000ns
- TotalStorageWaitTime: 36.999ms
CodeGen:(Total: 51.999ms, non-child: 51.999ms, % non-child: 100.00%)
- CodegenTime: 999.990us
- CompileTime: 5.999ms
- LoadTime: 0.000ns
- ModuleBitcodeSize: 1.86 MB (1953028)
- NumFunctions: 13 (13)
- NumInstructions: 228 (228)
- OptimizationTime: 11.999ms
- PrepareTime: 33.999ms
DataStreamSender (dst_id=3):(Total: 4.999ms, non-child: 4.999ms, % non-child: 100.00%)
- BytesSent: 314.85 KB (322407)
- NetworkThroughput(*): 153.74 MB/sec
- OverallThroughput: 61.49 MB/sec
- RowsReturned: 9.15K (9153)
- SerializeBatchTime: 2.999ms
- TransmitDataRPCTime: 1.999ms
- UncompressedRowBatchSize: 769.68 KB (788150)
HDFS_SCAN_NODE (id=0):(Total: 2s973ms, non-child: 2s973ms, % non-child: 100.00%)
- AverageHdfsReadThreadConcurrency: 0.00
- AverageScannerThreadConcurrency: 1.00
- BytesRead: 171.79 MB (180132958)
- BytesReadDataNodeCache: 0
- BytesReadLocal: 0
- BytesReadRemoteUnexpected: 0
- BytesReadShortCircuit: 0
- DecompressionTime: 130.998ms
- MaxCompressedTextFileLength: 0
- NumColumns: 8 (8)
- NumDisksAccessed: 1 (1)
- NumRowGroups: 1 (1)
- NumScannerThreadsStarted: 1 (1)
- PeakMemoryUsage: 106.05 MB (111196408)
- PerReadThreadRawHdfsThroughput: 434.91 MB/sec
- RemoteScanRanges: 0 (0)
- RowsRead: 28.05M (28047320)
- RowsReturned: 9.15K (9153)
- RowsReturnedRate: 3.08 K/sec
- ScanRangesComplete: 1 (1)
- ScannerThreadsInvoluntaryContextSwitches: 112 (112)
- ScannerThreadsTotalWallClockTime: 3s157ms
- MaterializeTupleTime(*): 2s977ms
- ScannerThreadsSysTime: 566.243ms
- ScannerThreadsUserTime: 2s525ms
- ScannerThreadsVoluntaryContextSwitches: 100 (100)
- TotalRawHdfsReadTime(*): 394.996ms
- TotalReadThroughput: 57.11 MB/sec
Fragment F00:
Instance dc4c084615fbf9bb:4261466f00000001 (host=mapr1:22000):(Total: 3s001ms, non-child: 0.000ns, % non-child: 0.00%)
Hdfs split stats (<volume id>:<# splits>/<split lengths>): -1:1/66.83 KB
Filter 0 arrival: 3s210ms
MemoryUsage(500.000ms): 0, 3.05 MB, 3.14 MB, 3.13 MB, 3.09 MB, 3.12 MB, 3.19 MB
ThreadUsage(500.000ms): 1, 2, 2, 2, 2, 2, 2
- AverageThreadTokens: 1.86
- BloomFilterBytes: 1.00 MB (1048576)
- PeakMemoryUsage: 5.07 MB (5320864)
- PerHostPeakMemUsage: 108.65 MB (113924736)
- PrepareTime: 38.999ms
- RowsProduced: 1.02K (1024)
- TotalCpuTime: 3s232ms
- TotalNetworkReceiveTime: 2s940ms
- TotalNetworkSendTime: 0.000ns
- TotalStorageWaitTime: 13.999ms
CodeGen:(Total: 262.997ms, non-child: 262.997ms, % non-child: 100.00%)
- CodegenTime: 999.990us
- CompileTime: 73.999ms
- LoadTime: 0.000ns
- ModuleBitcodeSize: 1.86 MB (1953028)
- NumFunctions: 85 (85)
- NumInstructions: 2.86K (2857)
- OptimizationTime: 151.998ms
- PrepareTime: 36.999ms
DataStreamSender (dst_id=4):(Total: 999.990us, non-child: 999.990us, % non-child: 100.00%)
- BytesSent: 61.05 KB (62513)
- NetworkThroughput(*): 0.00 /sec
- OverallThroughput: 59.62 MB/sec
- RowsReturned: 1.02K (1024)
- SerializeBatchTime: 999.990us
- TransmitDataRPCTime: 0.000ns
- UncompressedRowBatchSize: 185.83 KB (190290)
HASH_JOIN_NODE (id=2):(Total: 3s001ms, non-child: 42.999ms, % non-child: 1.43%)
ExecOption: Build Side Codegen Enabled, Probe Side Codegen Enabled, Hash Table Construction Codegen Enabled, Join Build-Side Prepared Asynchronously, 1 of 1 Runtime Filter Published
- BuildPartitionTime: 1.999ms
- BuildRows: 9.15K (9153)
- BuildRowsPartitioned: 9.15K (9153)
- BuildTime: 0.000ns
- GetNewBlockTime: 0.000ns
- HashBuckets: 16.38K (16384)
- HashCollisions: 0 (0)
- LargestPartitionPercent: 6 (6)
- MaxPartitionLevel: 0 (0)
- NumRepartitions: 0 (0)
- PartitionsCreated: 16 (16)
- PeakMemoryUsage: 3.29 MB (3445888)
- PinTime: 0.000ns
- ProbeRows: 1.02K (1024)
- ProbeRowsPartitioned: 0 (0)
- ProbeTime: 0.000ns
- RowsReturned: 0 (0)
- RowsReturnedRate: 0
- SpilledPartitions: 0 (0)
- UnpinTime: 0.000ns
EXCHANGE_NODE (id=3):(Total: 2s941ms, non-child: 999.990us, % non-child: 0.03%)
BytesReceived(500.000ms): 70.08 KB, 127.46 KB, 162.66 KB, 230.08 KB, 301.42 KB, 312.19 KB
- BytesReceived: 314.85 KB (322407)
- ConvertRowBatchTime: 0.000ns
- DeserializeRowBatchTimer: 0.000ns
- FirstBatchArrivalWaitTime: 0.000ns
- PeakMemoryUsage: 0
- RowsReturned: 9.15K (9153)
- RowsReturnedRate: 3.11 K/sec
- SendersBlockedTimer: 0.000ns
- SendersBlockedTotalTimer(*): 0.000ns
HDFS_SCAN_NODE (id=1):(Total: 16.999ms, non-child: 16.999ms, % non-child: 100.00%)
ExecOption: Expr Evaluation Codegen Disabled, PARQUET Codegen Enabled
Hdfs split stats (<volume id>:<# splits>/<split lengths>): -1:1/66.83 KB
Runtime filters: All filters arrived. Waited 0
BytesRead(500.000ms): 0, 0, 0, 0, 0, 0
- AverageHdfsReadThreadConcurrency: 0.00
- AverageScannerThreadConcurrency: 0.00
- BytesRead: 128.51 KB (131593)
- BytesReadDataNodeCache: 0
- BytesReadLocal: 0
- BytesReadRemoteUnexpected: 0
- BytesReadShortCircuit: 0
- DecompressionTime: 0.000ns
- MaxCompressedTextFileLength: 0
- NumColumns: 49 (49)
- NumDisksAccessed: 0 (0)
- NumRowGroups: 1 (1)
- NumScannerThreadsStarted: 1 (1)
- PeakMemoryUsage: 1.78 MB (1866400)
- PerReadThreadRawHdfsThroughput: 4.18 MB/sec
- RemoteScanRanges: 0 (0)
- RowsRead: 1.50K (1500)
- RowsReturned: 1.02K (1024)
- RowsReturnedRate: 60.23 K/sec
- ScanRangesComplete: 1 (1)
- ScannerThreadsInvoluntaryContextSwitches: 0 (0)
- ScannerThreadsTotalWallClockTime: 14.999ms
- MaterializeTupleTime(*): 999.990us
- ScannerThreadsSysTime: 0.000ns
- ScannerThreadsUserTime: 2.216ms
- ScannerThreadsVoluntaryContextSwitches: 27 (27)
- TotalRawHdfsReadTime(*): 29.999ms
- TotalReadThroughput: 0.00 /sec
Filter 0 (1.00 MB):
- Rows processed: 1.50K (1500)
- Rows rejected: 3 (3)
- Rows total: 1.50K (1500)
Fragment F01:
Instance dc4c084615fbf9bb:4261466f00000002 (host=mapr1:22000):(Total: 3s191ms, non-child: 160.998ms, % non-child: 5.04%)
Hdfs split stats (<volume id>:<# splits>/<split lengths>): -1:1/171.69 MB
MemoryUsage(500.000ms): 0, 91.50 MB, 82.91 MB, 86.63 MB, 67.68 MB, 66.67 MB, 53.51 MB
ThreadUsage(500.000ms): 1, 2, 2, 2, 2, 2, 2
- AverageThreadTokens: 1.86
- BloomFilterBytes: 0
- PeakMemoryUsage: 106.05 MB (111206232)
- PerHostPeakMemUsage: 108.65 MB (113924736)
- PrepareTime: 33.999ms
- RowsProduced: 9.15K (9153)
- TotalCpuTime: 6s330ms
- TotalNetworkReceiveTime: 0.000ns
- TotalNetworkSendTime: 0.000ns
- TotalStorageWaitTime: 36.999ms
CodeGen:(Total: 51.999ms, non-child: 51.999ms, % non-child: 100.00%)
- CodegenTime: 999.990us
- CompileTime: 5.999ms
- LoadTime: 0.000ns
- ModuleBitcodeSize: 1.86 MB (1953028)
- NumFunctions: 13 (13)
- NumInstructions: 228 (228)
- OptimizationTime: 11.999ms
- PrepareTime: 33.999ms
DataStreamSender (dst_id=3):(Total: 4.999ms, non-child: 4.999ms, % non-child: 100.00%)
- BytesSent: 314.85 KB (322407)
- NetworkThroughput(*): 153.74 MB/sec
- OverallThroughput: 61.49 MB/sec
- RowsReturned: 9.15K (9153)
- SerializeBatchTime: 2.999ms
- TransmitDataRPCTime: 1.999ms
- UncompressedRowBatchSize: 769.68 KB (788150)
HDFS_SCAN_NODE (id=0):(Total: 2s973ms, non-child: 2s973ms, % non-child: 100.00%)
ExecOption: Expr Evaluation Codegen Enabled, PARQUET Codegen Enabled, Codegen enabled: 1 out of 1
Hdfs split stats (<volume id>:<# splits>/<split lengths>): -1:1/171.69 MB
Hdfs Read Thread Concurrency Bucket: 0:100% 1:0% 2:0% 3:0% 4:0% 5:0%
File Formats: PARQUET/SNAPPY:8
BytesRead(500.000ms): 88.45 MB, 112.45 MB, 136.45 MB, 152.66 MB, 168.66 MB, 171.79 MB
- AverageHdfsReadThreadConcurrency: 0.00
- AverageScannerThreadConcurrency: 1.00
- BytesRead: 171.79 MB (180132958)
- BytesReadDataNodeCache: 0
- BytesReadLocal: 0
- BytesReadRemoteUnexpected: 0
- BytesReadShortCircuit: 0
- DecompressionTime: 130.998ms
- MaxCompressedTextFileLength: 0
- NumColumns: 8 (8)
- NumDisksAccessed: 1 (1)
- NumRowGroups: 1 (1)
- NumScannerThreadsStarted: 1 (1)
- PeakMemoryUsage: 106.05 MB (111196408)
- PerReadThreadRawHdfsThroughput: 434.91 MB/sec
- RemoteScanRanges: 0 (0)
- RowsRead: 28.05M (28047320)
- RowsReturned: 9.15K (9153)
- RowsReturnedRate: 3.08 K/sec
- ScanRangesComplete: 1 (1)
- ScannerThreadsInvoluntaryContextSwitches: 112 (112)
- ScannerThreadsTotalWallClockTime: 3s157ms
- MaterializeTupleTime(*): 2s977ms
- ScannerThreadsSysTime: 566.243ms
- ScannerThreadsUserTime: 2s525ms
- ScannerThreadsVoluntaryContextSwitches: 100 (100)
- TotalRawHdfsReadTime(*): 394.996ms
- TotalReadThroughput: 57.11 MB/sec
What I already did is: using parquet, partitioning, compute stats. But still can't get the same time as before.
From what I see. Most of the time was spent on Scan HDFS, which is very weird because this is not a time-consuming part usually. Please take a look. Any input would be nice. Thanks.

This could be because of HDFS SCAN on node 0 is taking more time:
-HDFS_SCAN_NODE (id=0):(Total: 2s973ms, non-child: 2s973ms, % non-child: 100.00%)
File Formats: PARQUET/SNAPPY:8
An exact question has been asked at MapR Converge Community:
https://community.mapr.com/message/59777-impala-scan-mapr-fs-slow

Related

Extracting memory and swap info from /proc/meminfo in Golang

I'd like to extract the values for MemTotal, MemFree, MemAvailable, SwapTotal and SwapFree from /proc/meminfo in Golang. The closest I've gotten so far, is to use fmt.Sscanf() which will give me the values I want one at a time, but I'm also getting many lines with zeros for output. Here's the code I'm using:
package main
import (
"bufio"
"fmt"
"os"
)
func main() {
f, e := os.Open("/proc/meminfo")
if e != nil {
panic(e)
}
defer f.Close()
s := bufio.NewScanner(f)
for s.Scan() {
var n int
fmt.Sscanf(s.Text(), "MemFree: %d kB", &n)
fmt.Println(n)
}
}
Which gives me the following results:
0
11260616
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
So the first question, is there a way to limit the results to the one value (non-zero) I'm after? Or, is there a better way to approach this altogether?
My /proc/meminfo file looks like this:
MemTotal: 16314336 kB
MemFree: 11268004 kB
MemAvailable: 13955820 kB
Buffers: 330284 kB
Cached: 2536848 kB
SwapCached: 0 kB
Active: 1259348 kB
Inactive: 3183140 kB
Active(anon): 4272 kB
Inactive(anon): 1578028 kB
Active(file): 1255076 kB
Inactive(file): 1605112 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 4194304 kB
SwapFree: 4194304 kB
Dirty: 96 kB
Writeback: 0 kB
AnonPages: 1411704 kB
Mapped: 594408 kB
Shmem: 6940 kB
KReclaimable: 151936 kB
Slab: 253384 kB
SReclaimable: 151936 kB
SUnreclaim: 101448 kB
KernelStack: 17184 kB
PageTables: 25060 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 12351472 kB
Committed_AS: 6092984 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 40828 kB
VmallocChunk: 0 kB
Percpu: 5696 kB
AnonHugePages: 720896 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
FileHugePages: 0 kB
FilePmdMapped: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 0 kB
DirectMap4k: 230400 kB
DirectMap2M: 11235328 kB
DirectMap1G: 14680064 kB
Note, s.Scan() reads the input line by line. If a line does not match the format string given to fmt.Sscanf, your program outputs 0 as var n int is declared inside the loop. My suggestion is to check the first result returned by fmt.Sscanf`, i.e., the number of items matched. So, if first result is 1 you have a match and you can output the value. See working example here: https://go.dev/play/p/RtBKusGg8wV
EDIT: I tried to stay as close as possible to your code. There may be further issues as the unit of measurement used may vary according to the man pages. It may be good enough for your use case, however, if the the values in question on your systems are always output in "kB".
I'd like to extract the values for MemTotal, MemFree, MemAvailable, SwapTotal and SwapFree from /proc/meminfo in Golang.
When I look at the values you provided from /proc/meminfo I think of a map: key/value pairs using items from the first column as keys and items from the second column as values.
To keep it simple, you could use map[string]string initially to collect, then convert where needed later to a specific type.
From there, you could use the comma ok idiom to check whether values are available for the specific data you'd like to retrieve.
If you didn't care about the specific values, you just wanted anything that was non-zero you could filter key/value pairs before you put them in the map: assert that they're not zero. I'd recommend explicitly trim spaces before any comparisons that you may make.
EDIT: Note, I ran into issues using other approaches and eventually switched to using a bufio.Scanner to process the file I was working with (also in the /proc filesystem).

Rocksdb concurrent Put performance

I would like to show some experimental results about Rocksdb Put performance. The fact that single-threaded put throughput is slower than two-threaded put throughput. It is wired because it uses the default skiplist as memtable, and this data structure supports concurrent writes.
Here is my testing code.
uint64_t nthread = 2;
uint64_t nkeys = 16000000;
std::thread threads[nthread];
std::atomic<uint64_t> idx(1000000);
for (int t = 0; t < nthread; t++) {
threads[t] = std::thread([db, &idx, nthread, nkeys, &write_option_disable] {
WriteBatch batch;
for (int i = 0; i < nkeys / nthread; i++) {
std::string key = "WVERIFY" + std::to_string(idx.fetch_add(1));
std::string value = "MOCK";
auto ikey = rocksdb::Slice(key);
auto ivalue = rocksdb::Slice(value);
db->Put(write_option_disable, ikey, ivalue);
}
return 0;
});
}
for (auto& t : threads) {
t.join();
}
Besides, here are the results I got.
// Single thread
Uptime(secs): 8.4 total, 8.3 interval
Flush(GB): cumulative 1.170, interval 1.170
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 1.17 GB write, 143.35 MB/s write, 0.00 GB read, 0.00 MB/s read, 8.1 seconds
Interval compaction: 1.17 GB write, 144.11 MB/s write, 0.00 GB read, 0.00 MB/s read, 8.1 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
Block cache LRUCache#0x564742515ea0#7011 capacity: 8.00 MB collections: 1 last_copies: 0 last_secs: 2e-05 secs_since: 8
Block cache entry stats(count,size,portion): Misc(1,0.00 KB,0%)
** File Read Latency Histogram By Level [default] **
** DB Stats **
Uptime(secs): 8.4 total, 8.3 interval
Cumulative writes: 16M writes, 16M keys, 16M commit groups, 1.0 writes per commit group, ingest: 1.63 GB, 199.80 MB/s
Cumulative WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s
Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 16M writes, 16M keys, 16M commit groups, 1.0 writes per commit group, ingest: 1669.88 MB, 200.85 MB/s
Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s
Interval stall: 00:00:0.000 H:M:S, 0.0 percent
// 2 threads
Uptime(secs): 31.4 total, 31.4 interval
Flush(GB): cumulative 0.183, interval 0.183
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 0.67 GB write, 21.84 MB/s write, 0.97 GB read, 31.68 MB/s read, 10.2 seconds
Interval compaction: 0.67 GB write, 21.87 MB/s write, 0.97 GB read, 31.72 MB/s read, 10.2 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
Block cache LRUCache#0x5619fb7bbea0#6183 capacity: 8.00 MB collections: 1 last_copies: 0 last_secs: 1.9e-05 secs_since: 31
Block cache entry stats(count,size,portion): Misc(1,0.00 KB,0%)
** File Read Latency Histogram By Level [default] **
** DB Stats **
Uptime(secs): 31.4 total, 31.4 interval
Cumulative writes: 16M writes, 16M keys, 11M commit groups, 1.4 writes per commit group, ingest: 0.45 GB, 14.67 MB/s
Cumulative WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s
Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 16M writes, 16M keys, 11M commit groups, 1.4 writes per commit group, ingest: 460.94 MB, 14.69 MB/s
Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s
Interval stall: 00:00:0.000 H:M:S, 0.0 percent
===========================update==========================
This is my Rocksdb's setting.
DB* db;
Options options;
BlockBasedTableOptions table_options;
rocksdb::WriteOptions write_option_disable;
write_option_disable.disableWAL = true;
// Optimize RocksDB. This is the easiest way to get RocksDB to perform well
options.IncreaseParallelism();
options.OptimizeLevelStyleCompaction();
// create the DB if it's not already present
options.create_if_missing = true;
The atomic idx shared between two threads can introduced non-trivial overhead. Try inserting random values from each thread, and maybe increase the number of threads.

TOP command overall CPU

I have below top command results in my RHEL 6. It's running PostgreSQL on my server.
I see 35.8% idle in CPU(s) while all the CPU usages below show 100%.
So how should I read below output?
top - 03:06:30 up 97 days, 20:15, 3 users, load average: 10.85, 10.51, 10.13
Tasks: 738 total, 14 running, 724 sleeping, 0 stopped, 0 zombie
**Cpu(s): 53.3%us, 9.6%sy, 0.0%ni, 35.8%id, 0.6%wa, 0.0%hi, 0.7%si, 0.0%st**
Mem: 32077620k total, 24335372k used, 7742248k free, 19084k buffers
Swap: 81919992k total, 407968k used, 81512024k free, 18686780k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19171 enterpri 20 0 8590m 966m 951m R 100.0 3.1 6:24.51 edb-postgres
19588 enterpri 20 0 8590m 956m 941m R 100.0 3.1 1:20.51 edb-postgres
18494 enterpri 20 0 8590m 959m 944m R 99.8 3.1 18:18.75 edb-postgres
18683 enterpri 20 0 8588m 984m 975m R 99.8 3.1 6:22.80 edb-postgres
19158 enterpri 20 0 8592m 1.0g 1.0g R 99.8 3.3 5:40.16 edb-postgres
19167 enterpri 20 0 8589m 959m 945m R 99.8 3.1 7:48.53 edb-postgres
19590 enterpri 20 0 8586m 945m 933m R 99.8 3.0 2:51.32 edb-postgres
19591 enterpri 20 0 8588m 950m 936m R 99.8 3.0 3:07.77 edb-postgres
19592 enterpri 20 0 8589m 948m 935m R 99.8 3.0 2:52.66 edb-postgres
You have a lot of CPUs (how many?) on your system. Some of them are very busy running postgres, and some of them are not.
In your version of top, %CPU represents the percent of a single CPU, not the percent of the total system CPU. If you had a threaded application, one entry could show more than 100%, but PostgreSQL is not threaded within a single process.

Counting swaps for sorting statistics - what with swaps with only two assignments instead of three

While helping out a student with his classes, I implemented the dual pivot quicksort algorithm to prepare a session and got intriged. After running some statistics, then solving the worst case situation, then running stats again, and again solving the next worst case situation, and repeating this process several times, the resulting code is no more then 80 lines of simple straightforward Python code (a bit less then Vladimir's code). The novel part is how the 3 partitions are constructed in combination with some very simple yet effective post processing of them. Now I need some help on how to test and make statistics properly.
Especially about how to count the swaps: most of the swaps only perform two assignements instead of three. So must I count them as full swaps or, is it fair to count them only as a '2/3' swap?
Counting every swap as 1, the Cn in Cn * N * log2(N) is around 0.48 on short lists (<100 elements) and around 0.55 on longer lists of several million elements. That is just the theoretical minimum as calculated by Vladimir Yaroslavskiy.
Counting the lighter swaps as 2/3 instead, the number of needed swaps is almost equal for any list size and is around 0.36 (stdev around 0.015).
The Cn for the number of comparisons is on average around 1.3 for lists of 2 million records, which is less then the theoretical 1.38 (from 2*N*ln(N)), and lower for shorter lists, i.e. for 1024 elements, it's around 1.21
That is for lists with 100% unique numbers and randomly ordered with Python's random.shuffle().
So my question is:
Is it ok to count the lighter swaps as such, and is the result indeed promising or not?
Also interesting is:
the more equal elements in the list, the faster is sorts. Cn is 0.03 and 0.1 for swaps and comparisons respectively for a 2 million list of all equal elements.
Cn for sorted and reversed sorted lists are almost the same for all sizes: 0.3 and 1 for the swaps (counted with 2/3) and comparisons respectively.
I will post a list with more statistics shortly which includes maximum stack depth, number of recursive calls besides the swaps and comparisons. Are there other things I should count?
Also, are there some 'standard' test suites with files of all kinds of situations (with equals, partially sorted etc.) one can use to test a sorting algorithm, and to make the results comparable with other sorting algorithms.
Added May 5:
I improved the algorithm especially for sorted lists.
Here are the resutls for 20 runs for each.
Are this good results?
New statistics:
Random.shuffle(), unique number
Length Swaps/Nlog2(N) Comparisons/Nlog2(N) Maximum Stack/log2(N)
16 0.367 0.922 0.250
64 0.360 1.072 0.500
256 0.342 1.122 0.625
1024 0.358 1.156 0.800
4096 0.359 1.199 0.917
16384 0.359 1.244 1.071
65536 0.360 1.244 1.125
262144 0.360 1.269 1.167
1048576 0.362 1.275 1.200
Sorted, unique numbers
Length Swaps/Nlog2(N) Comparisons/Nlog2(N) Maximum Stack/log2(N)
16 0.172 0.531 0.250
64 0.117 0.586 0.333
256 0.087 0.609 0.375
1024 0.075 0.740 0.500
4096 0.060 0.732 0.500
16384 0.051 0.726 0.500
65536 0.044 0.722 0.500
262144 0.041 0.781 0.556
1048576 0.036 0.774 0.550
2097152 0.035 0.780 0.571
Reversed order, unique numbers
Length Swaps/Nlog2(N) Comparisons/Nlog2(N) Maximum Stack/log2(N)
16 0.344 0.828 0.250
64 0.279 0.812 0.333
256 0.234 0.788 0.375
1024 0.210 0.858 0.500
4096 0.190 0.865 0.500
16384 0.172 0.855 0.500
65536 0.158 0.846 0.500
262144 0.153 0.900 0.556
1048576 0.143 0.892 0.550
2097152 0.140 0.895 0.571
I have chosen to count the assignments executed on the elements to be sorted, instead of 'swaps'. Assignements and comparisons of indexes are not counted.
I converted the code Vladimir Yaroslavskiy included in his document (Last updated: September 22, 2009) to Python and added the counters the same way as I did in my own implementation. The code is included at the end.
Any comments are welcome.
Here are the results, the averages of 10 runs.
The columns labeled VY are the results for the implementation by Vladimir, the columns labeled by JB are these of my own implementation.
Length F Function call Assignements Comparisons Maximum Stack
of list per N per N.log2(N) per N.log2(N) per log2(N)
Random.shuffle(), unique number
Version VY JB VY JB VY JB VY JB
64 1 0.170 0.266 1.489 1.029 1.041 1.028 0.417 0.633
256 1 0.171 0.270 1.463 1.016 1.066 1.138 0.575 0.812
1024 1 0.167 0.275 1.451 1.046 1.089 1.165 0.690 1.010
4096 1 0.164 0.273 1.436 1.069 1.119 1.189 0.800 1.075
16384 1 0.166 0.273 1.444 1.077 1.117 1.270 0.843 1.221
65536 1 0.166 0.273 1.440 1.108 1.126 1.258 0.919 1.281
262144 1 0.166 0.273 1.423 1.102 1.134 1.278 0.950 1.306
1048576 1 0.166 0.273 1.426 1.085 1.131 1.273 0.990 1.290
Sorted, unique numbers
Version VY JB VY JB VY JB VY JB
64 1 0.203 0.203 1.036 0.349 0.643 0.586 0.333 0.333
256 1 0.156 0.156 0.904 0.262 0.643 0.609 0.375 0.375
1024 1 0.118 0.355 0.823 0.223 0.642 0.740 0.400 0.500
4096 1 0.131 0.267 0.840 0.181 0.679 0.732 0.500 0.500
16384 1 0.200 0.200 0.926 0.152 0.751 0.726 0.500 0.500
65536 1 0.150 0.150 0.866 0.131 0.737 0.722 0.500 0.500
262144 1 0.113 0.338 0.829 0.124 0.728 0.781 0.500 0.556
1048576 1 0.147 0.253 0.853 0.108 0.750 0.774 0.550 0.550
Reversed order, unique numbers
Version VY JB VY JB VY JB VY JB
64 1 0.203 0.203 1.320 0.836 0.841 0.802 0.333 0.333
256 1 0.156 0.156 1.118 0.703 0.795 0.783 0.375 0.375
1024 1 0.118 0.312 1.002 0.631 0.768 0.852 0.400 0.500
4096 1 0.125 0.267 0.977 0.569 0.776 0.861 0.500 0.500
16384 1 0.200 0.200 1.046 0.516 0.834 0.852 0.500 0.500
65536 1 0.150 0.150 0.974 0.475 0.813 0.844 0.500 0.500
262144 1 0.113 0.338 0.925 0.459 0.795 0.896 0.500 0.556
1048576 1 0.145 0.253 0.938 0.430 0.811 0.890 0.550 0.550
Random, with increasing frequency of the numbers.
The last row is a list of the same number
Version VY JB VY JB VY JB VY JB
65536 1 0.166 0.273 1.429 1.051 1.113 1.251 0.881 1.156
65536 2 0.167 0.270 1.404 1.075 1.112 1.238 0.894 1.194
65536 4 0.168 0.273 1.373 1.039 1.096 1.213 0.906 1.238
65536 8 0.151 0.245 1.302 1.029 1.069 1.199 0.900 1.262
65536 16 0.132 0.127 1.264 0.970 1.020 1.150 0.912 1.188
65536 32 0.090 0.064 1.127 0.920 0.950 1.099 0.856 1.119
65536 64 0.051 0.032 1.000 0.845 0.879 0.993 0.819 1.019
65536 128 0.026 0.016 0.884 0.792 0.797 0.923 0.725 0.931
65536 256 0.013 0.008 0.805 0.704 0.728 0.840 0.675 0.856
65536 512 0.006 0.004 0.690 0.615 0.652 0.728 0.588 0.669
65536 1024 0.003 0.002 0.635 0.557 0.579 0.654 0.519 0.625
65536 2048 0.002 0.001 0.541 0.487 0.509 0.582 0.438 0.463
65536 4096 0.001 0.000 0.459 0.417 0.434 0.471 0.369 0.394
65536 8192 0.000 0.000 0.351 0.359 0.357 0.405 0.294 0.300
65536 16384 0.000 0.000 0.247 0.297 0.253 0.314 0.206 0.194
65536 32768 0.000 0.000 0.231 0.188 0.209 0.212 0.125 0.081
65536 65536 0.000 0.000 0.063 0.125 0.063 0.125 0.062 0.000
Here is the code of Vladimirs sort in Python:
DIST_SIZE = 13
TINY_SIZE = 17
def dualPivotQuicksort(a, left, right, nesting=0):
global assignements, comparisons, oproepen, maxnesting
oproepen += 1
maxnesting = max(maxnesting, nesting)
length = right - left
if length < TINY_SIZE: # insertion sort on tiny array
# note by JB: rewritten to minimize the assignements
for i in xrange(left+1, right+1):
key = a[i]
assignements += 1
while i > left:
comparisons += 1
if key < a[i - 1]:
assignements += 1
a[i] = a[i-1]
i -= 1
else:
break
assignements += 1
a[i] = key
return
# median indexes
sixth = length / 6
m1 = left + sixth
m2 = m1 + sixth
m3 = m2 + sixth
m4 = m3 + sixth
m5 = m4 + sixth
assignements += 9*3
comparisons += 9
## 5-element sorting network
if a[m1] > a[m2]: a[m1],a[m2] = a[m2],a[m1]
if a[m4] > a[m5]: a[m4],a[m5] = a[m5],a[m4]
if a[m1] > a[m3]: a[m1],a[m3] = a[m3],a[m1]
if a[m2] > a[m3]: a[m2],a[m3] = a[m3],a[m2]
if a[m1] > a[m4]: a[m1],a[m4] = a[m4],a[m1]
if a[m3] > a[m4]: a[m3],a[m4] = a[m4],a[m3]
if a[m2] > a[m5]: a[m2],a[m5] = a[m5],a[m2]
if a[m2] > a[m3]: a[m2],a[m3] = a[m3],a[m2]
if a[m4] > a[m5]: a[m4],a[m5] = a[m5],a[m4]
# pivots: [ < pivot1 | pivot1 <= && <= pivot2 | > pivot2 ]
assignements += 2
pivot1 = a[m2]
pivot2 = a[m4]
comparisons += 1
diffPivots = pivot1 != pivot2
assignements += 2
a[m2] = a[left]
a[m4] = a[right]
# center part pointers
less = left + 1
great = right - 1
# sorting
if (diffPivots):
k = less
while k <= great:
assignements += 1
x = a[k]
comparisons += 2
if (x < pivot1):
comparisons -= 1
assignements += 2
a[k] = a[less]
a[less] = x
less += 1
elif (x > pivot2):
while k < great:
comparisons += 1
if a[great] > pivot2:
great -= 1
else:
break
assignements += 3
a[k] = a[great]
a[great] = x
great -= 1
x = a[k]
comparisons += 1
if (x < pivot1):
assignements += 2
a[k] = a[less]
a[less] = x
less += 1
k += 1
else:
k = less
while k <= great:
assignements += 1
x = a[k]
comparisons += 1
if (x == pivot1):
k += 1
continue
comparisons += 1
if (x < pivot1):
assignements += 2
a[k] = a[less]
a[less] = x
less += 1
else:
while k < great:
comparisons += 1
if a[great] > pivot2:
great -= 1
else:
break
assignements += 3
a[k] = a[great]
a[great] = x
great -= 1
x = a[k]
comparisons += 1
if (x < pivot1):
assignements += 2
a[k] = a[less]
a[less] = x
less += 1
k += 1
# swap
assignements += 2
a[left] = a[less - 1]
a[less - 1] = pivot1
assignements += 2
a[right] = a[great + 1]
a[great + 1] = pivot2
# left and right parts
dualPivotQuicksort(a, left, less - 2, nesting+1)
dualPivotQuicksort(a, great + 2, right, nesting+1)
# equal elements
if (great - less > length - DIST_SIZE and diffPivots):
k = less
while k <= great:
assignements += 1
x = a[k]
comparisons += 2
if (x == pivot1):
comparisons -= 1
assignements += 2
a[k] = a[less]
a[less] = x
less += 1
elif (x == pivot2):
assignements += 3
a[k] = a[great]
a[great] = x
great -= 1
x = a[k]
comparisons += 1
if (x == pivot1):
assignements += 2
a[k] = a[less]
a[less] = x
less += 1
k += 1
# center part
if (diffPivots):
dualPivotQuicksort(a, less, great, nesting+1)
This code is about 190 lines, my current implementation written with the same formatting is about 110 lines.
So any remarks are welcome.

.gvs (GuideView openmp statistics) file format

Is there a format of *.gvs files, used by GuideView OpenMP performance analyser?
The "guide.gvs" is generated, f.e. by intel's OpenMP'ed programmes with
$ export LD_PRELOAD=<path_to_icc_or_redist>/lib/libiompprof5.so
$ ./openmp_parallelized_prog
$ ls -l guide.gvs
It s a plain text.
Here is an example of such from very short omp programme:
$ cat guide.gvs
*** KAI statistics library k3301
*** Begin Task 0
Environment variables:
OMP_NUM_THREADS : 2
OMP_SCHEDULE : static
OMP_DYNAMIC : FALSE
OMP_NESTED : FALSE
KMP_STATSFILE : guide.gvs
KMP_STATSCOLS : 80
KMP_INTERVAL : 0
KMP_BLOCKTIME : 200
KMP_PARALLEL : 2
KMP_STACKSIZE : 2097152
KMP_STACKOFFSET : 0
KMP_SCHEDULING : <unknown>
KMP_CHUNK : <unknown>
KMP_LIBRARY : throughput
end
System parameters:
start : Wed Nov 1 12:26:52 2010
stop : Wed Nov 1 12:26:52 2010
host : localhost
ncpu : 2
end
Unix process parameters:
maxrss : 0
minflt : 440
majflt : 2
nswap : 0
inblock : 208
oublock : 0
nvcsw : 6
nivcsw : 7
end
Region counts:
serial regions : 2
barrier regions : 0
parallel regions : 1
end
Program execution time (in seconds):
cpu : 0.00 sec
elapsed : 0.04 sec
serial : 0.00 sec
parallel : 0.04 sec
cpu percent : 0.01 %
end
Summary over all regions (has 2 threads):
# Thread #0 #1
Sum Parallel : 0.036 0.027
Sum Imbalance : 0.035 0.026
Min Parallel : 0.036 0.027
Min Imbalance : 0.035 0.026
Max Parallel : 0.036 0.027
Max Imbalance : 0.035 0.026
end
Region #1 (has 2 threads) at main/9 in "/home/user/icc/omp.c":
# Thread #0 #1
Sum Parallel : 0.036 0.027
Sum Imbalance : 0.035 0.026
Min Parallel : 0.036 0.027
Min Imbalance : 0.035 0.026
Max Parallel : 0.036 0.027
Max Imbalance : 0.035 0.026
end
Region #1 (has 2 threads) profile:
# Thread Incl Excl Routine
0,0 : 0.000 0.000 main/9 "/home/user/icc/omp.c"
1,0 : 0.000 0.000 main/9 "/home/user/icc/omp.c"
end
Serial program regions:
Serial region #1 executes for 0.00 seconds
begins at START OF PROGRAM
ends before region #1 (using 2 threads) at main/9 in "/home/user/icc/omp.c"
Serial region #2 executes for 0.00 seconds
begins after region #1 (using 2 threads) at main/9 in "/home/user/icc/omp.c"
ends at END OF PROGRAM
end
Serial region #1 profile:
# Thread Incl Excl Routine
end
Serial region #2 profile:
# Thread Incl Excl Routine
end
Program events (total):
# Thread #0 #1
mppbeg : 1 0
mppend : 1 0
serial : 2 0
mppfkd : 1 0
mppfrk : 1 0
mppjoi : 1 0
mppadj : 1 0
mpptid : 51 50
end
Region #1 (has 2 threads) events:
# Thread #0 #1
mppfrk : 1 0
mppjoi : 1 0
mpptid : 50 50
end
Serial section events:
# Serial #1 #2
mppbeg : 1 0
mppend : 0 1
serial : 1 1
mppfkd : 1 0
mppadj : 1 0
mpptid : 1 0
end
*** end

Resources