How to tune SparkPageRank example application for shorter garbage collection? - performance

I am running a Spark application in a 7 node cluster - 1 driver and 6 executors on amazon EC2 machines. I use 6 m4.2xlarge instances with 1 executor each. They have 8 cores each. The driver is on a m4.xlarge VM, which had 4 cores. The spark version is 2.1.1.
I use the following command to start SparkPageRank application.
spark-submit \
--name "ABC" \
--master spark://xxx:7077 \
--conf spark.driver.memory=10g \
--conf "spark.app.name=ABC" \
--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:ConcGCThreads=5" \
--class org.apache.spark.examples.SparkPageRank \
--executor-memory 22g \
/home/ubuntu/spark-2.1.1/examples/target/scala-2.11/jars/spark-examples_2.11-2.1.1.jar /hdfscheck/pagerank_data_11G_repl1.txt 4
The GC time using these configuration comes out to be really high.
Here is a chunk of the GC log for one of the executor:
1810.053: [GC pause (GCLocker Initiated GC) (young), 0.1694102 secs]
[Parallel Time: 167.8 ms, GC Workers: 8]
[GC Worker Start (ms): Min: 1810053.2, Avg: 1810053.3, Max: 1810053.4, Diff: 0.1]
[Ext Root Scanning (ms): Min: 0.2, Avg: 0.4, Max: 0.7, Diff: 0.5, Sum: 2.9]
[Update RS (ms): Min: 12.4, Avg: 12.7, Max: 13.2, Diff: 0.7, Sum: 101.4]
[Processed Buffers: Min: 11, Avg: 12.9, Max: 16, Diff: 5, Sum: 103]
[Scan RS (ms): Min: 29.4, Avg: 29.8, Max: 30.1, Diff: 0.7, Sum: 238.7]
[Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
[Object Copy (ms): Min: 124.5, Avg: 124.6, Max: 124.7, Diff: 0.1, Sum: 996.9]
[Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
[Termination Attempts: Min: 1, Avg: 2.2, Max: 5, Diff: 4, Sum: 18]
[GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
[GC Worker Total (ms): Min: 167.5, Avg: 167.5, Max: 167.6, Diff: 0.1, Sum: 1340.2]
[GC Worker End (ms): Min: 1810220.8, Avg: 1810220.8, Max: 1810220.8, Diff: 0.0]
[Code Root Fixup: 0.0 ms]
[Code Root Purge: 0.0 ms]
[Clear CT: 0.4 ms]
[Other: 1.2 ms]
[Choose CSet: 0.0 ms]
[Ref Proc: 0.5 ms]
[Ref Enq: 0.0 ms]
[Redirty Cards: 0.4 ms]
[Humongous Register: 0.0 ms]
[Humongous Reclaim: 0.0 ms]
[Free CSet: 0.1 ms]
[Eden: 992.0M(960.0M)->0.0B(960.0M) Survivors: 160.0M->160.0M Heap: 14.6G(22.0G)->13.8G(22.0G)]
[Times: user=1.34 sys=0.00, real=0.17 secs]
(More at https://pastebin.com/E5bbQZgD)
I could only see one fishy thing that the concurrent-mark-end took a lot of time.
I would appreciate if someone could tell me how to tune garbage collection for this particular case. The VM on which driver node is located has 16GB of memory, whereas the executor VMs have 32GB memory.

(Not really an answer but just a collection of hints to help out).
My driver node has 16GB of memory
I don't think it's the case given you executed spark-submit with spark.driver.memory=10g.
You should use --driver-memory instead (which is just a shortcut, but makes things a little easier to remember):
--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
Regarding your main question, the high use of GC that seems to be how the PageRank algorithm works. Note the high use of Shuffle Write with Input not that large.
I also think that the GC time is not that long as compared to Task Time.
I'm concerned with RDD Blocks being only 2 as that seems to suggest the parallelism being very low, but that might be how it should work.
run-example SparkPageRank
The following can be replaced with a simple run-example SparkPageRank (as described in Spark's Where to Go from Here)
spark-submit ... \
--class org.apache.spark.examples.SparkPageRank ... \
/home/ubuntu/spark-2.1.1/examples/target/scala-2.11/jars/spark-examples_2.11-2.1.1.jar

Related

Find index of maximum element satisfying condition (Julia)

In Julia I can use argmax(X) to find max element. If I want to find all element satisfying condition C I can use findall(C,X). But how can I combine the two? What's the most efficient/idiomatic/concise way to find maximum element index satisfying some condition in Julia?
If you'd like to avoid allocations, filtering the array lazily would work:
idx_filtered = (i for (i, el) in pairs(X) if C(el))
argmax(i -> X[i], idx_filtered)
Unfortunately, this is about twice as slow as a hand-written version. (edit: in my benchmarks, it's 2x slower on Intel Xeon Platinum but nearly equal on Apple M1)
function byhand(C, X)
start = findfirst(C, X)
isnothing(start) && return nothing
imax, max = start, X[start]
for i = start:lastindex(X)
if C(X[i]) && X[i] > max
imax, max = i, X[i]
end
end
imax, max
end
You can store the index returned by findall and subset it with the result of argmax of the vector fulfilling the condition.
X = [5, 4, -3, -5]
C = <(0)
i = findall(C, X);
i[argmax(X[i])]
#3
Or combine both:
argmax(i -> X[i], findall(C, X))
#3
Assuming that findall is not empty. Otherwise it need to be tested e.g. with isempty.
Benchmark
#Functions
function August(C, X)
idx_filtered = (i for (i, el) in pairs(X) if C(el))
argmax(i -> X[i], idx_filtered)
end
function byhand(C, X)
start = findfirst(C, X)
isnothing(start) && return nothing
imax, max = start, X[start]
for i = start:lastindex(X)
if C(X[i]) && X[i] > max
imax, max = i, X[i]
end
end
imax, max
end
function GKi1(C, X)
i = findall(C, X);
i[argmax(X[i])]
end
GKi2(C, X) = argmax(i -> X[i], findall(C, X))
#Data
using Random
Random.seed!(42)
n = 100000
X = randn(n)
C = <(0)
#Benchmark
using BenchmarkTools
suite = BenchmarkGroup()
suite["August"] = #benchmarkable August(C, $X)
suite["byhand"] = #benchmarkable byhand(C, $X)
suite["GKi1"] = #benchmarkable GKi1(C, $X)
suite["GKi2"] = #benchmarkable GKi2(C, $X)
tune!(suite);
results = run(suite)
#Results
results
#4-element BenchmarkTools.BenchmarkGroup:
# tags: []
# "August" => Trial(641.061 μs)
# "byhand" => Trial(261.135 μs)
# "GKi2" => Trial(259.260 μs)
# "GKi1" => Trial(339.570 μs)
results.data["August"]
#BenchmarkTools.Trial: 7622 samples with 1 evaluation.
# Range (min … max): 641.061 μs … 861.379 μs ┊ GC (min … max): 0.00% … 0.00%
# Time (median): 643.640 μs ┊ GC (median): 0.00%
# Time (mean ± σ): 653.027 μs ± 18.123 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
#
# ▄█▅▄▃ ▂▂▃▁ ▁▃▃▂▂ ▁▃ ▁▁ ▁
# ██████▇████████████▇▆▆▇████▇▆██▇▇▇▆▆▆▅▇▆▅▅▅▅▆██▅▆▆▆▇▆▇▇▆▇▆▆▆▅ █
# 641 μs Histogram: log(frequency) by time 718 μs <
#
# Memory estimate: 16 bytes, allocs estimate: 1.
results.data["byhand"]
#BenchmarkTools.Trial: 10000 samples with 1 evaluation.
# Range (min … max): 261.135 μs … 621.141 μs ┊ GC (min … max): 0.00% … 0.00%
# Time (median): 261.356 μs ┊ GC (median): 0.00%
# Time (mean ± σ): 264.382 μs ± 11.638 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
#
# █ ▁▁▁▁ ▂ ▁▁ ▂ ▁ ▁ ▁
# █▅▂▂▅████▅▄▃▄▆█▇▇▆▄▅███▇▄▄▅▆▆█▄▇█▅▄▅▅▆▇▇▅▄▅▄▄▄▃▄▃▃▃▄▅▆▅▄▇█▆▅▄ █
# 261 μs Histogram: log(frequency) by time 292 μs <
#
# Memory estimate: 32 bytes, allocs estimate: 1.
results.data["GKi1"]
#BenchmarkTools.Trial: 10000 samples with 1 evaluation.
# Range (min … max): 339.570 μs … 1.447 ms ┊ GC (min … max): 0.00% … 0.00%
# Time (median): 342.579 μs ┊ GC (median): 0.00%
# Time (mean ± σ): 355.167 μs ± 52.935 μs ┊ GC (mean ± σ): 1.90% ± 6.85%
#
# █▆▄▅▃▂▁▁ ▁ ▁
# ████████▇▆▆▅▅▅▆▄▄▄▄▁▃▁▁▃▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ █
# 340 μs Histogram: log(frequency) by time 722 μs <
#
# Memory estimate: 800.39 KiB, allocs estimate: 11.
results.data["GKi2"]
#BenchmarkTools.Trial: 10000 samples with 1 evaluation.
# Range (min … max): 259.260 μs … 752.773 μs ┊ GC (min … max): 0.00% … 54.40%
# Time (median): 260.692 μs ┊ GC (median): 0.00%
# Time (mean ± σ): 270.300 μs ± 40.094 μs ┊ GC (mean ± σ): 1.31% ± 5.60%
#
# █▁▁▅▄▂▂▄▃▂▁▁▁ ▁ ▁
# █████████████████▇██▆▆▇▆▅▄▆▆▆▄▅▄▆▅▇▇▆▆▅▅▄▅▃▃▅▃▄▁▁▁▃▁▃▃▃▄▃▃▁▃▃ █
# 259 μs Histogram: log(frequency) by time 390 μs <
#
# Memory estimate: 408.53 KiB, allocs estimate: 9.
versioninfo()
#Julia Version 1.8.0
#Commit 5544a0fab7 (2022-08-17 13:38 UTC)
#Platform Info:
# OS: Linux (x86_64-linux-gnu)
# CPU: 8 × Intel(R) Core(TM) i7-2600K CPU # 3.40GHz
# WORD_SIZE: 64
# LIBM: libopenlibm
# LLVM: libLLVM-13.0.1 (ORCJIT, sandybridge)
# Threads: 1 on 8 virtual cores
In this example argmax(i -> X[i], findall(C, X)) is close to the performance of the hand written function of #August but uses more memory, but can show better performance in case the data is sorted:
sort!(X)
results = run(suite)
#4-element BenchmarkTools.BenchmarkGroup:
# tags: []
# "August" => Trial(297.519 μs)
# "byhand" => Trial(270.486 μs)
# "GKi2" => Trial(242.320 μs)
# "GKi1" => Trial(319.732 μs)
From what I understand from your question you can use findmax() (requires Julia >= v1.7) to find the maximum index on the result of findall():
julia> v = [10, 20, 30, 40, 50]
5-element Vector{Int64}:
10
20
30
40
50
julia> findmax(findall(x -> x > 30, v))[1]
5
Performance of the above function:
julia> v = collect(10:1:10_000_000);
julia> #btime findmax(findall(x -> x > 30, v))[1]
33.471 ms (10 allocations: 77.49 MiB)
9999991
Update: solution suggested by #dan-getz of using last() and findlast() perform better than findmax() but findlast() is the winner:
julia> #btime last(findall(x -> x > 30, v))
19.961 ms (9 allocations: 77.49 MiB)
9999991
julia> #btime findlast(x -> x > 30, v)
81.422 ns (2 allocations: 32 bytes)
Update 2: Looks like the OP wanted to find the max element and not only the index. In that case, the solution would be:
julia> v[findmax(findall(x -> x > 30, v))[1]]
50

What does "L" means in Elasticsearch Node status?

Bellow is the Node status of my Elasticsearch cluster(please follow the node.role column,
[root#manager]# curl -XGET http://192.168.6.51:9200/_cat/nodes?v
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
192.168.6.54 20 97 0 0.00 0.00 0.00 dim - siem03.arif.local
192.168.6.51 34 55 0 0.16 0.06 0.01 l - siem00.arif.local
192.168.6.52 15 97 0 0.00 0.00 0.00 dim * siem01.arif.local
192.168.6.53 14 97 0 0.00 0.00 0.00 dim - siem02.arif.local
From Elasticsearch Documentation,
node.role, r, role, nodeRole
(Default) Roles of the node. Returned values include m (master-eligible node), d (data node), i (ingest node), and - (coordinating node only).
So, from the above output, the dim means, Data + Master + Ingest node. Which is absolutely correct. But I configured the host siem00.arif.local as a coordinating node. But it showed l which is not an option described by the documentation.
So what does it mean? It was just - before. But after an update (which I have pushed on each of the nodes) it doesn't work anymore and shows l in the node.role
UPDATE:
All the other nodes except the coordinating node were 1 version back. Now I have updated all of the nodes with exact same version. Now it works and here is the output,
[root#manager]# curl -XGET http://192.168.6.51:9200/_cat/nodes?v
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
192.168.6.53 9 79 2 0.00 0.20 0.19 dilm * siem02.arif.local
192.168.6.52 13 78 2 0.18 0.24 0.20 dilm - siem01.arif.local
192.168.6.51 33 49 1 0.02 0.21 0.20 l - siem00.arif.local
192.168.6.54 12 77 4 0.02 0.19 0.17 dilm - siem03.arif.local
Current Version is :
[root#manager]# rpm -qa | grep elasticsearch
elasticsearch-7.4.0-1.x86_64
The built-in roles are indeed d, m, i and -, but any plugin is free to define new roles if needed. There's another one called v for voting-only nodes.
The l role is for Machine Learning nodes (i.e. those with node.ml: true) as can be seen in the source code of MachineLearning.java in the MachineLearning plugin.

Huge performance difference with Iperf between - with and without VPN Tunnel

I am running some performance measures between the different network settings using IPerf. I see very drastic differences between two basic setups.
Two containers (docker) connected to each other via the default docker0 bridge interface in the host.
Two containers connected via a VPNTunnel interface that is internally connected via the above docker0 bridge.
IPerf calculation for both scenarios for 10sec
**Scenario One (1)**
Client connecting to 172.17.0.4, TCP port 5001
TCP window size: 1.12 MByte (default)
------------------------------------------------------------
[ 3] local 172.17.0.2 port 50728 connected with 172.17.0.4 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 3.26 GBytes 28.0 Gbits/sec
[ 3] 1.0- 2.0 sec 3.67 GBytes 31.5 Gbits/sec
[ 3] 2.0- 3.0 sec 3.70 GBytes 31.8 Gbits/sec
[ 3] 3.0- 4.0 sec 3.93 GBytes 33.7 Gbits/sec
[ 3] 4.0- 5.0 sec 3.34 GBytes 28.7 Gbits/sec
[ 3] 5.0- 6.0 sec 3.44 GBytes 29.6 Gbits/sec
[ 3] 6.0- 7.0 sec 3.55 GBytes 30.5 Gbits/sec
[ 3] 7.0- 8.0 sec 3.50 GBytes 30.0 Gbits/sec
[ 3] 8.0- 9.0 sec 3.41 GBytes 29.3 Gbits/sec
[ 3] 9.0-10.0 sec 3.20 GBytes 27.5 Gbits/sec
[ 3] 0.0-10.0 sec 35.0 GBytes 30.1 Gbits/sec
**Scenario Two (2)**
Client connecting to 10.23.0.2, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[ 3] local 10.12.0.2 port 41886 connected with 10.23.0.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 15.1 MBytes 127 Mbits/sec
[ 3] 1.0- 2.0 sec 14.9 MBytes 125 Mbits/sec
[ 3] 2.0- 3.0 sec 14.9 MBytes 125 Mbits/sec
[ 3] 3.0- 4.0 sec 14.2 MBytes 120 Mbits/sec
[ 3] 4.0- 5.0 sec 16.4 MBytes 137 Mbits/sec
[ 3] 5.0- 6.0 sec 18.0 MBytes 151 Mbits/sec
[ 3] 6.0- 7.0 sec 18.6 MBytes 156 Mbits/sec
[ 3] 7.0- 8.0 sec 16.4 MBytes 137 Mbits/sec
[ 3] 8.0- 9.0 sec 13.5 MBytes 113 Mbits/sec
[ 3] 9.0-10.0 sec 15.0 MBytes 126 Mbits/sec
[ 3] 0.0-10.0 sec 157 MBytes 132 Mbits/sec
I am confused as to the high differences in throughput.
Is it due to the encryption and decryption and OpenSSL involved that makes this degradation?
Or is it because of the need for unmarshalling and marshalling of packet headers below the application layer more than once when routing through the VPN tunnel?
Thank You
Shabir
Both tests did not run equally - the first test ran with a TCP window of 1.12 Mbyte, while the second slower test ran with a window of 0.085 MByte:
Client connecting to 172.17.0.4, TCP port 5001
TCP window size: 1.12 MByte (default)
^^^^
Client connecting to 10.23.0.2, TCP port 5001
TCP window size: 85.0 KByte (default)
^^^^
Thus, it's possible that you're experiencing TCP window exhaustion, both because of the smaller buffer and because of the mildly increased latency through the vpn stack.
In order to know what buffer size to use (if not just a huge buffer), you need to know your bandwidth-delay product.
I don't know what your original channel RTT is, but we can take a stab at it. You were able to get ~30 gbit/sec over your link with a buffer size of 1.12 MBytes, then doing the math backwards (unit conversions elided), we get:
1.12 megabytes / 30 gigabits/sec --> 0.3 ms.
That seems reasonable. Now let's assume your vpn has double the RTT than the original link, so we'll assume a latency of 0.6 ms. Then, we'll use your new window size of 0.085 MByte to figure out what kind of performance you should expect by calculating the bandwidth-delay product forwards:
0.085 Mbytes / 0.6 ms --> BDP = 141 mbit/sec.
Well, what do you know, that's about the exact performance you're seeing.
If, for example, you wanted to saturate a 100 gigabit/sec pipe with a RTT of 0.6 ms, then you would need a buffer size of 7.5 Mbytes. Alternatively if you wanted to saturate the pipe not with a single connection but with N connections, then you'd need N sockets each with a send buffer size of 7.5/N Mbytes.

G1 GC processes references too slow

So, we're using G1 GC. and 18GB heap. Young generation size is about 3,5G. And heap maximum usage is about 12G. And memory is full of short-living objects.
Also, maybe it's important that Couchbase instance is running on the same node near the JVM. It takes all iops often when it's persisting changes on hdd, but there are enough free cpu time and memory.
Enabled JVM options:
-Xmx18g -Xms18g -XX:MaxPermSize=512M -XX:+UseG1GC -XX:+DisableExplicitGC -XX:+ParallelRefProcEnabled
My sad problem is young generation GC pauses. Long pause GC log usually looks like:
2013-07-10T15:06:25.963+0400: 9122,066: [GC pause (young)
Desired survivor size 243269632 bytes, new threshold 5 (max 15)
- age 1: 69789280 bytes, 69789280 total
- age 2: 58618240 bytes, 128407520 total
- age 3: 54519720 bytes, 182927240 total
- age 4: 51592728 bytes, 234519968 total
- age 5: 45186520 bytes, 279706488 total
9122,066: [G1Ergonomics (CSet Construction) start choosing CSet, predicted base time: 174,16 ms, remaining time: 25,84 ms, target pause time: 200,00 ms]
9122,066: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 426 regions, survivors: 34 regions, predicted young region time: 164,97 ms]
9122,066: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 426 regions, survivors: 34 regions, old: 0 regions, predicted pause time: 339,13 ms, target pause time: 200,00 ms]
9122,259: [SoftReference, 0 refs, 0,0063780 secs]9124,575: [WeakReference, 4 refs, 0,0031600 secs]9124,578: [FinalReference, 1640 refs, 0,0033730 secs]9124,581: [PhantomReference, 145 refs, 0,0032080 secs]9124,5
85: [JNI Weak Reference, 0,0000810 secs], 2,53669600 secs]
[Parallel Time: 190,5 ms]
[GC Worker Start (ms): 9122066,6 9122066,7 9122066,7 9122066,7 9122066,8 9122066,9 9122066,9 9122066,9 9122066,9 9122067,0 9122067,0 9122067,1 9122067,1 9122067,1 9122067,1 9122067,2 91220
67,2 9122067,3
Avg: 9122067,0, Min: 9122066,6, Max: 9122067,3, Diff: 0,7]
[Ext Root Scanning (ms): 4,7 6,0 4,8 4,5 4,2 4,3 4,2 4,3 4,6 3,4 13,5 5,2 4,2 5,6 4,2 4,1 4,3 4,0
Avg: 5,0, Min: 3,4, Max: 13,5, Diff: 10,1]
[Update RS (ms): 20,9 19,6 21,1 21,3 21,2 21,2 21,3 21,2 21,7 21,5 12,1 20,2 21,1 19,4 21,0 21,1 20,7 21,2
Avg: 20,4, Min: 12,1, Max: 21,7, Diff: 9,6]
[Processed Buffers : 27 23 25 29 31 22 25 34 28 14 36 23 24 22 28 24 25 24
Sum: 464, Avg: 25, Min: 14, Max: 36, Diff: 22]
[Scan RS (ms): 9,0 9,2 8,7 8,8 9,1 9,1 8,9 9,1 8,3 9,2 9,0 9,1 9,2 9,2 9,1 9,0 9,0 9,1
Avg: 9,0, Min: 8,3, Max: 9,2, Diff: 1,0]
[Object Copy (ms): 145,1 145,0 145,2 145,1 145,1 144,9 145,1 144,9 144,9 145,4 144,8 144,8 144,8 145,0 145,0 145,1 145,2 144,9
Avg: 145,0, Min: 144,8, Max: 145,4, Diff: 0,6]
[Termination (ms): 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0
Avg: 0,0, Min: 0,0, Max: 0,0, Diff: 0,0]
[Termination Attempts : 5 8 2 11 5 6 6 5 5 7 4 7 2 9 8 5 7 8
Sum: 110, Avg: 6, Min: 2, Max: 11, Diff: 9]
[GC Worker End (ms): 9122246,4 9122246,6 9122246,7 9122246,6 9122246,7 9122246,7 9122246,5 9122246,7 9122246,5 9122246,5 9122246,6 9122246,7 9122246,8 9122246,4 9122246,6 9122246,5 9122246
,7 9122246,8
Avg: 9122246,6, Min: 9122246,4, Max: 9122246,8, Diff: 0,3]
[GC Worker (ms): 179,8 179,9 180,0 179,8 179,9 179,9 179,6 179,8 179,6 179,5 179,6 179,6 179,7 179,3 179,5 179,4 179,4 179,5
Avg: 179,7, Min: 179,3, Max: 180,0, Diff: 0,7]
[GC Worker Other (ms): 10,7 10,7 10,8 10,8 10,9 10,9 11,0 11,0 11,0 11,1 11,1 11,1 11,2 11,2 11,2 11,2 11,3 11,3
Avg: 11,0, Min: 10,7, Max: 11,3, Diff: 0,6]
[Clear CT: 2,8 ms]
[Other: 2343,4 ms]
[Choose CSet: 0,1 ms]
[Ref Proc: 2327,7 ms]
[Ref Enq: 1,9 ms]
[Free CSet: 8,2 ms]
[Eden: 3408M(3408M)->0B(3400M) Survivors: 272M->280M Heap: 9998M(18432M)->6638M(18432M)]
[Times: user=3,26 sys=0,02, real=2,54 secs]
Total time for which application threads were stopped: 2,5434370 seconds
The only phase of GC which makes problems is 'Reference processing'. But the log looks strange: soft, weak, final and jni references processing took very little time. But overall time is 2,5 seconds. Or it can be even more, up to 10 seconds in worst cases.
Another pause (more comfortable) may look like:
2013-07-10T16:26:11.862+0400: 13907,965: [GC pause (young)
Desired survivor size 243269632 bytes, new threshold 4 (max 15)
- age 1: 69125832 bytes, 69125832 total
- age 2: 58756480 bytes, 127882312 total
- age 3: 52397376 bytes, 180279688 total
- age 4: 88850424 bytes, 269130112 total
13907,965: [G1Ergonomics (CSet Construction) start choosing CSet, predicted base time: 77,38 ms, remaining time: 122,62 ms, target pause time: 200,00 ms]
13907,965: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 427 regions, survivors: 33 regions, predicted young region time: 167,95 ms]
13907,965: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 427 regions, survivors: 33 regions, old: 0 regions, predicted pause time: 245,33 ms, target pause time: 200,00 ms]
13908,155: [SoftReference, 0 refs, 0,0041340 secs]13908,160: [WeakReference, 0 refs, 0,0023850 secs]13908,162: [FinalReference, 1393 refs, 0,0065970 secs]13908,169: [PhantomReference, 108 refs, 0,0018650 secs]13908,171: [JNI Weak Reference, 0,0000630 secs], 0,22008100 secs]
[Parallel Time: 188,4 ms]
[GC Worker Start (ms): 13907965,3 13907965,3 13907965,4 13907965,4 13907965,5 13907965,5 13907965,6 13907965,6 13907965,6 13907965,7 13907965,7 13907965,7 13907965,8 13907965,8 13907965,8 13907965,9 13907965,9 13907965,9
Avg: 13907965,6, Min: 13907965,3, Max: 13907965,9, Diff: 0,6]
[Ext Root Scanning (ms): 5,8 5,0 6,8 6,3 6,1 6,2 6,0 6,3 5,2 4,2 5,0 6,2 4,5 6,0 17,1 4,4 6,2 5,3
Avg: 6,3, Min: 4,2, Max: 17,1, Diff: 12,9]
[Update RS (ms): 24,8 26,0 23,9 24,1 24,1 24,1 24,2 23,9 25,0 25,2 25,1 24,1 26,0 24,3 13,7 25,7 24,2 24,7
Avg: 24,1, Min: 13,7, Max: 26,0, Diff: 12,2]
[Processed Buffers : 30 20 9 16 16 19 20 21 22 12 30 17 17 20 12 20 17 22
Sum: 340, Avg: 18, Min: 9, Max: 30, Diff: 21]
[Scan RS (ms): 7,5 7,1 7,2 7,5 7,6 7,5 7,5 7,6 7,1 7,4 7,6 7,2 7,2 7,4 7,2 7,5 7,0 7,7
Avg: 7,4, Min: 7,0, Max: 7,7, Diff: 0,7]
[Object Copy (ms): 133,1 133,1 133,2 133,1 133,2 133,1 133,2 133,1 133,5 134,0 133,0 133,2 133,0 132,9 132,6 133,1 133,2 132,9
Avg: 133,1, Min: 132,6, Max: 134,0, Diff: 1,3]
[Termination (ms): 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0
Avg: 0,0, Min: 0,0, Max: 0,0, Diff: 0,0]
[Termination Attempts : 1 3 1 1 1 1 1 1 1 1 2 3 2 1 1 1 1 1
Sum: 24, Avg: 1, Min: 1, Max: 3, Diff: 2]
[GC Worker End (ms): 13908136,6 13908136,9 13908136,5 13908136,7 13908136,7 13908136,8 13908136,7 13908136,7 13908136,8 13908136,8 13908136,5 13908136,6 13908136,5 13908136,5 13908136,5 13908136,5 13908136,8 13908136,6
Avg: 13908136,7, Min: 13908136,5, Max: 13908136,9, Diff: 0,4]
[GC Worker (ms): 171,3 171,6 171,1 171,2 171,2 171,3 171,1 171,1 171,1 171,2 170,8 170,9 170,7 170,7 170,7 170,6 171,0 170,7
Avg: 171,0, Min: 170,6, Max: 171,6, Diff: 0,9]
[GC Worker Other (ms): 17,2 17,2 17,3 17,3 17,4 17,4 17,5 17,5 17,5 17,5 17,6 17,6 17,7 17,7 17,7 17,7 17,8 17,8
Avg: 17,5, Min: 17,2, Max: 17,8, Diff: 0,6]
[Clear CT: 1,6 ms]
[Other: 30,1 ms]
[Choose CSet: 0,1 ms]
[Ref Proc: 17,1 ms]
[Ref Enq: 0,9 ms]
[Free CSet: 7,4 ms]
[Eden: 3416M(3416M)->0B(3456M) Survivors: 264M->224M Heap: 7289M(18432M)->3912M(18432M)]
[Times: user=3,16 sys=0,00, real=0,22 secs]
Reference processing is still longest phase, but it's much shorter. ParallelRefProcEnabled wasn't be a cure for my problem. I've also tried to change size of young gen. It also not helped. Setting different -XX:MaxGCPauseMillis, more relaxed 600ms or more strict 100, results in still bad throughput.
CMS performance is even worse than G1 with parameters:
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
Young gen collections going longer and more often.
I'm totally confused with described logs. Tomorrow I'll try to move Couchbase instance to another node in order to check if it's freezes GC threads.
But, if Couchbase is not the point, maybe someone could explain me meaning of the logs. Or maybe there are some magic CMS parameters to fix that thing.
I'll be very glad for any help!
Problem was fixed by ourselves. I we've got a new rule - NEVER install Couchbase near JVM.
In the past we had such problem with instance Postgresql conflicting with Couchbase because Couchbase likes to grab all disk ops and Postgresql cannot commit anything.
So, isolate Couchbase and everything gonna be allright.

Why is it faster to sum a Data.Sequence by divide-and-conquer, even with no parallelism?

I was playing with parallel reduction on a Data.Sequence.Seq, and I noticed that divide-and-conquer gives a speed advantage even without parallelism. Does anyone know why?
Here's my code:
import qualified Data.Sequence as S
import qualified Data.Foldable as F
import System.Random
import Control.DeepSeq
import Criterion.Main
import Test.QuickCheck
import Control.Exception ( evaluate )
instance (Arbitrary a) => Arbitrary (S.Seq a) where
arbitrary = fmap S.fromList arbitrary
instance NFData a => NFData (S.Seq a) where
rnf = F.foldr seq ()
funs :: [(String, S.Seq Int -> Int)]
funs =
[ ("seqDirect" , seqDirect)
, ("seqFoldr" , seqFoldr)
, ("seqFoldl'" , seqFoldl')
, ("seqSplit 1" , (seqSplit 1))
, ("seqSplit 2" , (seqSplit 2))
, ("seqSplit 4" , (seqSplit 4))
, ("seqSplit 8" , (seqSplit 8))
, ("seqSplit 16" , (seqSplit 16))
, ("seqSplit 32" , (seqSplit 32)) ]
main :: IO ()
main = do
mapM_ (\(_,f) -> quickCheck (\xs -> seqDirect xs == f xs)) funs
gen <- newStdGen
let inpt = S.fromList . take 100000 $ randoms gen
evaluate (rnf inpt)
defaultMain [ bench n (nf f inpt) | (n,f) <- funs ]
seqDirect :: S.Seq Int -> Int
seqDirect v = case S.viewl v of
S.EmptyL -> 0
x S.:< xs -> x + seqDirect xs
seqFoldr :: S.Seq Int -> Int
seqFoldr = F.foldr (+) 0
seqFoldl' :: S.Seq Int -> Int
seqFoldl' = F.foldl' (+) 0
seqSplit :: Int -> S.Seq Int -> Int
seqSplit 1 xs = seqFoldr xs
seqSplit _ xs | S.null xs = 0
seqSplit n xs =
let (a, b) = S.splitAt (S.length xs `div` n) xs
sa = seqFoldr a
sb = seqSplit (n-1) b
in sa + sb
And the results:
$ ghc -V
The Glorious Glasgow Haskell Compilation System, version 7.0.4
$ ghc --make -O2 -fforce-recomp -rtsopts seq.hs
[1 of 1] Compiling Main ( seq.hs, seq.o )
Linking seq ...
$ ./seq +RTS -s
./seq +RTS -s
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
warming up
estimating clock resolution...
mean is 5.882556 us (160001 iterations)
found 2368 outliers among 159999 samples (1.5%)
2185 (1.4%) high severe
estimating cost of a clock call...
mean is 85.26448 ns (44 iterations)
found 4 outliers among 44 samples (9.1%)
3 (6.8%) high mild
1 (2.3%) high severe
benchmarking seqDirect
mean: 23.37511 ms, lb 23.01101 ms, ub 23.77594 ms, ci 0.950
std dev: 1.953348 ms, lb 1.781578 ms, ub 2.100916 ms, ci 0.950
benchmarking seqFoldr
mean: 25.60206 ms, lb 25.39648 ms, ub 25.80034 ms, ci 0.950
std dev: 1.030794 ms, lb 926.7246 us, ub 1.156656 ms, ci 0.950
benchmarking seqFoldl'
mean: 10.65757 ms, lb 10.29087 ms, ub 10.99869 ms, ci 0.950
std dev: 1.819595 ms, lb 1.703732 ms, ub 1.922018 ms, ci 0.950
benchmarking seqSplit 1
mean: 25.50376 ms, lb 25.29045 ms, ub 25.71225 ms, ci 0.950
std dev: 1.075497 ms, lb 961.5707 us, ub 1.229739 ms, ci 0.950
benchmarking seqSplit 2
mean: 18.15032 ms, lb 17.62943 ms, ub 18.66413 ms, ci 0.950
std dev: 2.652232 ms, lb 2.288088 ms, ub 3.044585 ms, ci 0.950
benchmarking seqSplit 4
mean: 10.48334 ms, lb 10.14152 ms, ub 10.87061 ms, ci 0.950
std dev: 1.869274 ms, lb 1.690063 ms, ub 1.997915 ms, ci 0.950
benchmarking seqSplit 8
mean: 5.737956 ms, lb 5.616747 ms, ub 5.965689 ms, ci 0.950
std dev: 825.2361 us, lb 442.1652 us, ub 1.232003 ms, ci 0.950
benchmarking seqSplit 16
mean: 3.677038 ms, lb 3.669035 ms, ub 3.685547 ms, ci 0.950
std dev: 42.18741 us, lb 36.57112 us, ub 49.93574 us, ci 0.950
benchmarking seqSplit 32
mean: 2.855626 ms, lb 2.849962 ms, ub 2.862226 ms, ci 0.950
std dev: 31.25475 us, lb 26.49104 us, ub 37.18611 us, ci 0.950
25,154,069,064 bytes allocated in the heap
4,120,506,464 bytes copied during GC
32,344,120 bytes maximum residency (446 sample(s))
4,042,704 bytes maximum slop
78 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 42092 collections, 0 parallel, 6.57s, 6.57s elapsed
Generation 1: 446 collections, 0 parallel, 2.62s, 2.62s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 18.57s ( 18.58s elapsed)
GC time 9.19s ( 9.19s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 27.76s ( 27.77s elapsed)
%GC time 33.1% (33.1% elapsed)
Alloc rate 1,354,367,579 bytes per MUT second
Productivity 66.9% of total user, 66.9% of total elapsed
Note: This answer doesn't actually answer the question. It only restates the question in a different way. The precise reason why Data.Sequence.foldr slows down as the sequence is getting bigger is still unknown.
Your code
seqFoldr :: S.Seq Int -> Int
seqFoldr = F.foldr (+) 0
has non-linear performance depending on the length of the sequence. Take a look at this benchmark:
./seq-customized +RTS -s -A128M
[Length] [Performance of function seqFoldr]
25000: mean: 1.096352 ms, lb 1.083301 ms, ub 1.121152 ms, ci 0.950
50000: mean: 2.542133 ms, lb 2.514076 ms, ub 2.583209 ms, ci 0.950
100000: mean: 6.068437 ms, lb 5.951889 ms, ub 6.237442 ms, ci 0.950
200000: mean: 14.41332 ms, lb 13.95552 ms, ub 15.21217 ms, ci 0.950
Using the line with 25000 as a base gives us the following table:
[Length] [Performance of function seqFoldr]
1x: mean: 1.00 = 1*1.00
2x: mean: 2.32 = 2*1.16
4x: mean: 5.54 = 4*1.39
8x: mean: 13.15 = 8*1.64
In the above table, the non-linearity is demonstrated by the series 1.00, 1.16, 1.39, 1.64.
See also http://haskell.org/haskellwiki/Performance#Data.Sequence_vs._lists
Assuming the initial length of Seq xs is 100000 and n is 32, your code
seqSplit n xs =
let (a, b) = S.splitAt (S.length xs `div` n) xs
sa = seqFoldr a
sb = seqSplit (n-1) b
in sa + sb
will be passing somewhat shorter Seqs to the function seqFoldr. The successive lengths of the Seqs passed from the above code to function seqFoldr look like:
(length xs)/n = (length a)
--------------------------
100000/32 = 3125
(100000-3125)/31 = 3125
(100000-2*3125)/30 = 3125
...
(100000-30*3125)/2 = 3125
Based on the first part of my answer (where we saw that the performance was non-linear), [32 calls to seqFoldr with Seq of length 3125] will execute faster than [1 call to seqFoldr with a single Seq of length 32*3125=100000].
Thus, the answer to your question is: Because foldr on Data.Sequence is slower as the sequence is getting larger.
Try use foldr' instead of foldr. I bet it is by lazy behavior of foldr which leads to allocate thunk for each data in sequence and evaluate on the end.
Edit:
So using foldr' halves taken time in mine case but still slower even then foldl'. Which means there is some complexity issue in Data.Sequence.fold* implementation.
benchmarking seqFoldr
collecting 100 samples, 1 iterations each, in estimated 2.516484 s
bootstrapping with 100000 resamples
mean: 24.93222 ms, lb 24.72772 ms, ub 25.15255 ms, ci 0.950
std dev: 1.081204 ms, lb 938.4503 us, ub 1.332666 ms, ci 0.950
found 1 outliers among 100 samples (1.0%)
variance introduced by outliers: 0.999%
variance is unaffected by outliers
benchmarking seqFoldr'
collecting 100 samples, 1 iterations each, in estimated 902.7004 ms
bootstrapping with 100000 resamples
mean: 11.05375 ms, lb 10.68481 ms, ub 11.42519 ms, ci 0.950
std dev: 1.895777 ms, lb 1.685334 ms, ub 2.410870 ms, ci 0.950
found 1 outliers among 100 samples (1.0%)
variance introduced by outliers: 1.000%
variance is unaffected by outliers
benchmarking seqFoldl'
collecting 100 samples, 1 iterations each, in estimated 862.4077 ms
bootstrapping with 100000 resamples
mean: 10.35651 ms, lb 9.947395 ms, ub 10.73637 ms, ci 0.950
std dev: 2.011693 ms, lb 1.875869 ms, ub 2.131425 ms, ci 0.950
variance introduced by outliers: 1.000%
variance is unaffected by outliers

Resources