Why Impala Scan Node is very slow (RowBatchQueueGetWaitTime)? - hadoop

This query returns in 10 seconds most of the times, but occasionally it need 40 seconds or more.
There are two executer nodes in the swarm, and there is no remarkable difference between profiles of the two nodes, following is one of them:
HDFS_SCAN_NODE (id=0):(Total: 39s818ms, non-child: 39s818ms, % non-child: 100.00%)
- AverageHdfsReadThreadConcurrency: 0.07
- AverageScannerThreadConcurrency: 1.47
- BytesRead: 563.73 MB (591111366)
- BytesReadDataNodeCache: 0
- BytesReadLocal: 0
- BytesReadRemoteUnexpected: 0
- BytesReadShortCircuit: 0
- CachedFileHandlesHitCount: 0 (0)
- CachedFileHandlesMissCount: 560 (560)
- CollectionItemsRead: 0 (0)
- DecompressionTime: 1s501ms
- MaterializeTupleTime(*): 11s685ms
- MaxCompressedTextFileLength: 0
- NumColumns: 9 (9)
- NumDictFilteredRowGroups: 0 (0)
- NumDisksAccessed: 1 (1)
- NumRowGroups: 56 (56)
- NumScannerThreadMemUnavailable: 0 (0)
- NumScannerThreadReservationsDenied: 0 (0)
- NumScannerThreadsStarted: 4 (4)
- NumScannersWithNoReads: 0 (0)
- NumStatsFilteredRowGroups: 0 (0)
- PeakMemoryUsage: 142.10 MB (149004861)
- PeakScannerThreadConcurrency: 2 (2)
- PerReadThreadRawHdfsThroughput: 151.39 MB/sec
- RemoteScanRanges: 1.68K (1680)
- RowBatchBytesEnqueued: 2.32 GB (2491334455)
- RowBatchQueueGetWaitTime: 39s786ms
- RowBatchQueuePeakMemoryUsage: 1.87 MB (1959936)
- RowBatchQueuePutWaitTime: 0.000ns
- RowBatchesEnqueued: 6.38K (6377)
- RowsRead: 73.99M (73994828)
- RowsReturned: 6.40M (6401849)
- RowsReturnedRate: 161.27 K/sec
- ScanRangesComplete: 56 (56)
- ScannerThreadsInvoluntaryContextSwitches: 99 (99)
- ScannerThreadsTotalWallClockTime: 1m10s
- ScannerThreadsSysTime: 630.808ms
- ScannerThreadsUserTime: 12s824ms
- ScannerThreadsVoluntaryContextSwitches: 1.25K (1248)
- TotalRawHdfsOpenFileTime(*): 9s396ms
- TotalRawHdfsReadTime(*): 3s789ms
- TotalReadThroughput: 11.70 MB/sec
Buffer pool:
- AllocTime: 1.240ms
- CumulativeAllocationBytes: 706.32 MB (740630528)
- CumulativeAllocations: 578 (578)
- PeakReservation: 140.00 MB (146800640)
- PeakUnpinnedBytes: 0
- PeakUsedReservation: 33.83 MB (35471360)
- ReadIoBytes: 0
- ReadIoOps: 0 (0)
- ReadIoWaitTime: 0.000ns
- WriteIoBytes: 0
- WriteIoOps: 0 (0)
- WriteIoWaitTime: 0.000ns
We can notice that RowBatchQueueGetWaitTime is very high, almost 40 seconds, but I cannot figure out why, admitting that TotalRawHdfsOpenFileTime takes 9 seconds and TotalRawHdfsReadTime takes almost 4 seconds, I still cannot explain where are other 27 seconds spend on.
Can you suggest the possible issue and how can I solve it?

The threading model in the scan nodes is pretty complex because there are two layers of workers threads for scanning and I/O - I'll call them scanner and I/O threads. I'll go top down and call out some potential bottlenecks and how to identify them.
High RowBatchQueueGetWaitTime indicates that the main thread consuming from the scan is spending a lot of time waiting for the scanner threads to produce rows. One major source of variance can be the number of scanner threads - if the system is under resource pressure each query can get fewer threads. So keep an eye on AverageScannerThreadConcurrency to understand if that is varying.
The scanner threads would be spending their time doing a variety of things. The bulk of time is generally
Not running because the operating system scheduled a different thread.
Waiting for I/O threads to read data from the storage system
Decoding data, evaluating predicates, other work
With #1 you would see a higher value for ScannerThreadsInvoluntaryContextSwitches and ScannerThreadsUserTime/ScannerThreadsSysTime much lower than ScannerThreadsTotalWallClockTime. If ScannerThreadsUserTime is much lower than MaterializeTupleTime, that would be another symptom.
With #2 you would see high ScannerThreadsUserTime and MaterializeTupleTime. It looks like here there is a significant amount of CPU time going to that, but not the bulk of the time.
To identify #3, I would recommend looking at TotalStorageWaitTime in the fragment profile to understand how much time threads actually spent waiting for I/O. I also added ScannerIoWaitTime in more recent Impala releases which is more convenient since it's in the scanner profile.
If the storage wait time is slow, there are a few things to consider
If TotalRawHdfsOpenFileTime is high, it could be that opening the files is a bottleneck. This can happen on any storage system, including HDFS. See Why Impala spend a lot of time Opening HDFS File (TotalRawHdfsOpenFileTime)?
If TotalRawHdfsReadTime is high, reading from the storage system may be slow (e.g. if the data is not in the OS buffer cache or it is a remote filesystem like S3)
Other queries may be contending for I/O resources and/or I/O threads
I suspect in your case that the root cause is both slowness opening files for this query, and slowness opening files for other queries causing scanner threads to be occupied. Likely enabling file handle caching will solve the problem - we've seen dramatic improvements in performance on production deployments by doing that.
Another possibility worth mentioning is that the built-in JVM is doing some garbage collection - this could block some of the HDFS operations. We have some pause detection that logs messages when there is a JVM pause. You can also look at the /memz debug page, which I think has some GC stats. Or connect up other Java debugging tools.

ScannerThreadsVoluntaryContextSwitches: 1.25K (1248) means that there were 1248 situations were scan threads got "stuck" waiting for some external resource, and subsequently put to sleep().
Most likely that resource was disk IO. That would explain quite low average reading speed (TotalReadThroughput: *11.70 MB*/sec) while having "normal" per-read thruput (PerReadThreadRawHdfsThroughput: 151.39 MB/sec).
EDIT
To increase performance, you may want to try:
enable short circuit reads (dfs.client.read.shortcircuit=true)
configure HDFS caching and alter Impala table to use cache
(Note that both applicable if you're running Impala against HDFS, not some sort of object store.)

Related

How to get better performace in ProxmoxVE + CEPH cluster

We have been running ProxmoxVE since 5.0 (now in 6.4-15) and we noticed a decay in performance whenever there is some heavy reading/writing.
We have 9 nodes, 7 with CEPH and 56 OSDs (8 on each node). OSDs are hard drives (HDD) WD Gold or better (4~12 Tb). Nodes with 64/128 Gbytes RAM, dual Xeon CPU mainboards (various models).
We already tried simple tests like "ceph tell osd.* bench" getting stable 110 Mb/sec data transfer to each of them with +- 10 Mb/sec spread during normal operations. Apply/Commit Latency is normally below 55 ms with a couple of OSDs reaching 100 ms and one-third below 20 ms.
The front network and back network are both 1 Gbps (separated in VLANs), we are trying to move to 10 Gbps but we found some trouble we are still trying to figure out how to solve (unstable OSDs disconnections).
The Pool is defined as "replicated" with 3 copies (2 needed to keep running). Now the total amount of disk space is 305 Tb (72% used), reweight is in use as some OSDs were getting much more data than others.
Virtual machines run on the same 9 nodes, most are not CPU intensive:
Avg. VM CPU Usage < 6%
Avg. Node CPU Usage < 4.5%
Peak VM CPU Usage 40%
Peak Node CPU Usage 30%
But I/O Wait is a different story:
Avg. Node IO Delay 11
Max. Node IO delay 38
Disk writing load is around 4 Mbytes/sec average, with peaks up to 20 Mbytes/sec.
Anyone with experience in getting better Proxmox+CEPH performance?
Thank you all in advance for taking the time to read,
Ruben.
Got some Ceph pointers that you could follow...
get some good NVMEs (one or two per server but if you have 8HDDs per server 1 should be enough) and put those as DB/WALL (make sure they have power protection)
the ceph tell osd.* bench is not that relevant for real world, I suggest to try some FIO tests see here
set OSD osd_memory_target to at 8G or RAM minimum.
in order to save some write on your HDD (data is not replicated X times) create your RBD pool as EC (erasure coded pool) but please do some research on that because there are some tradeoffs. Recovery takes some extra CPU calculations
All and all, hype-converged clusters are good for training, small projects and medium projects with not such a big workload on them... Keep in mind that planning is gold
Just my 2 cents,
B.

Redis - Benchmark vs reality

I have a Redis standalone instance in production. Earlier 8 instances of my application, each having 64 Redis connections(total 12*64) at a rate of 2000 QPS per instance would give me a latency of < 10ms(which I am fine with). Due to an increase in traffic, I had to increase the number of application instances to 16, while also decreasing the connection count per instance from 128 to 16 (total 16*16=256). This was done after benchmarking with memtier benchmark as below
12 Threads
64 Connections per thread
2000 Requests per thread
ALL STATS
========================================================================
Type Ops/sec Hits/sec Misses/sec Latency KB/sec
------------------------------------------------------------------------
Sets 0.00 --- --- 0.00000 0.00
Gets 79424.54 516.26 78908.28 9.90400 2725.45
Waits 0.00 --- --- 0.00000 ---
Totals 79424.54 516.26 78908.28 9.90400 2725.45
16 Threads
16 Connections per thread
2000 Requests per thread
ALL STATS
========================================================================
Type Ops/sec Hits/sec Misses/sec Latency KB/sec
------------------------------------------------------------------------
Sets 0.00 --- --- 0.00000 0.00
Gets 66631.87 433.11 66198.76 3.32800 2286.47
Waits 0.00 --- --- 0.00000 ---
Totals 66631.87 433.11 66198.76 3.32800 2286.47
Redis benchmark gave similar results.
However, when I made this change in Production, (16*16), the latency shot up back to 60-70ms. I thought the connection count provisioned was less (which seemed unlikely) and went back to 64 connections (64*16), which as expected increased the latency further. For now, I have half of my applications hitting the master Redis and the other half connected to slave with each having 64 connections (8*64 to master, 8*64 to slave) and this works for me(8-10ms latency).
What could have gone wrong that the latency increased with 256 (16*16) connections but reduced with 512(64*8)connections even though the benchmark says otherwise? I agree to not fully trust the benchmark, but even as a guideline, these are polar opposite results.
Note: 1. Application and Redis are colocated, there is no network latency, memory used is about 40% in Redis and the fragmentation ratio is about 1.4. The application uses Jedis for connection pooling. 2. The latency does not include the overhead of Redis miss, only the Redis round trip is considered.

Get current disk load

Since I can't use watch on iostat -dx 1 to get the current disk load, I'd like to know if there is an alternative way to do this, e.g., doing calculations with the values contained in /proc/diskstats and/or some other files.
According to kernel.org, the mapping is :
The /proc/diskstats file displays the I/O statistics
of block devices. Each line contains the following 14
fields:
1 - major number
2 - minor mumber
3 - device name
4 - reads completed successfully
5 - reads merged
6 - sectors read
7 - time spent reading (ms)
8 - writes completed
9 - writes merged
10 - sectors written
11 - time spent writing (ms)
12 - I/Os currently in progress
13 - time spent doing I/Os (ms)
14 - weighted time spent doing I/Os (ms)
For more details refer to Documentation/iostats.txt
You can use or read Sys::Statistics::Linux::DiskStats too

Application not running at full speed?

I have the following scenario:
machine 1: receives messages from outside and processes them (via a
Java application). For processing it relies on a database (on machine
2)
machine 2: an Oracle DB
As performance metrics I usually look at the value of processed messages per time.
Now, what puzzles me: none of the 2 machines is working on "full speed". If I look at typical parameters (CPU utilization, CPU load, I/O bandwidth, etc.) both machines look as they have not enough to do.
What I expect is that one machine, or one of the performance related parameters limits the overall processing speed. Since I cannot observe this I would expect a higher message processing rate.
Any ideas what might limit the overall performance? What is the bottleneck?
Here are some key values during workload:
Machine 1:
CPU load average: 0.75
CPU Utilization: System 12%, User 13%, Wait 5%
Disk throughput: 1 MB/s (write), almost no reads
average tps (as reported by iostat): 200
network: 500 kB/s in, 300 kB/s out, 1600 packets/s in, 1600 packets/s out
Machine 2:
CPU load average: 0.25
CPU Utilization: System 3%, User 15%, Wait 17%
Disk throughput: 4.5 MB/s (write), 3.5 MB/s (read)
average tps (as reported by iostat): 190 (very short peaks to 1000-1500)
network: 250 kB/s in, 800 kB/s out, 1100 packets/s in, 1100 packets/s out
So for me, all values seem not to be at any limit.
PS: for testing of course the message queue is always full, so that both machines have enough work to do.
To find bottlenecks you typically need to measure also INSIDE the application. That means profiling the java application code and possibly what happens inside Oracle.
The good news is that you have excluded at least some possible hardware bottlenecks.

Using time command for benchmarking

I'm trying to use the time command as a simple solution for benchmarking some scripts that do a lot of text processing and makes a number of network calls. To evaluate if its a good fit, I tried doing:
/usr/bin/time -f "\n%E elapsed,\n%U user,\n%S system, \n %P CPU, \n%M
max-mem footprint in KB, \n%t avg-mem footprint in KB, \n%K Average total
(data+stack+text) memory,\n%F major page faults, \n%I file system
inputs by the process, \n%O file system outputs by the process, \n%r
socket messages received, \n%s socket messages sent, \n%x status" yum
install nmap
and got:
1:35.15 elapsed,
3.17 user,
0.40 system,
3% CPU,
0 max-mem footprint in KB,
0 avg-mem footprint in KB,
0 Average total (data+stack+text) memory,
127 major page faults,
0 file system inputs by the process,
0 file system outputs by the process,
0 socket messages received,
0 socket messages sent,
0 status
which is not exactly what I was expecting - specially the 0 values. Even when I change the command to say ping google.com, the socket messages are 0. What's going on? Is there any alternative?
[And I'm confused if it should stay here or be posted in serverfault]
I think it's not working with Linux; I assume you're using Linux since you said "strace". The manual page says:
Bugs
Not all resources are measured by all versions of Unix,
so some of the values might be reported as zero. The present
selection was mostly inspired by the data provided by 4.2 or
4.3BSD.
I tried "wget" on an OSX system (which is BSD-ish) to check if it report socket statistics, and there at least socket works:
0.00 user,
0.01 system,
1% CPU,
0 max-mem footprint in KB,
0 avg-mem footprint in KB,
0 Average total (data+stack+text) memory,
0 major page faults,
0 file system inputs by the process,
0 file system outputs by the process,
151 socket messages received,
8 socket messages sent,
0 status
Hope that helps,
Alex.
Do not use time to benchmark. Some of the fields of the time command is broken as specified in [1]. However the basic functionality of time (real , user and cpu time) are still intact.
[1] Maximum resident set size does not make sense

Resources