Related
To begin with, we had an aerospike cluster having 5 nodes of i2.2xlarge type in AWS, which our production fleet of around 200 servers was using to store/retrieve data. The aerospike config of the cluster was as follows -
service {
user root
group root
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
pidfile /var/run/aerospike/asd.pid
service-threads 8
transaction-queues 8
transaction-threads-per-queue 4
fabric-workers 8
transaction-pending-limit 100
proto-fd-max 25000
}
logging {
# Log file must be an absolute path.
file /var/log/aerospike/aerospike.log {
context any info
}
}
network {
service {
address any
port 3000
}
heartbeat {
mode mesh
port 3002 # Heartbeat port for this node.
# List one or more other nodes, one ip-address & port per line:
mesh-seed-address-port <IP> 3002
mesh-seed-address-port <IP> 3002
mesh-seed-address-port <IP> 3002
mesh-seed-address-port <IP> 3002
# mesh-seed-address-port <IP> 3002
interval 250
timeout 10
}
fabric {
port 3001
}
info {
port 3003
}
}
namespace FC {
replication-factor 2
memory-size 7G
default-ttl 30d # 30 days, use 0 to never expire/evict.
high-water-disk-pct 80 # How full may the disk become before the server begins eviction
high-water-memory-pct 70 # Evict non-zero TTL data if capacity exceeds # 70% of 15GB
stop-writes-pct 90 # Stop writes if capacity exceeds 90% of 15GB
storage-engine device {
device /dev/xvdb1
write-block-size 256K
}
}
It was properly handling the traffic corresponding to the namespace "FC", with latencies within 14 ms, as shown in the following graph plotted using graphite -
However, on turning on another namespace, with much higher traffic on the same cluster, it started to give a lot of timeouts and higher latencies, as we scaled up the number of servers using the same cluster of 5 nodes (increasing number of servers step by step from 20 to 40 to 60) with the following namespace configuration -
namespace HEAVYNAMESPACE {
replication-factor 2
memory-size 35G
default-ttl 30d # 30 days, use 0 to never expire/evict.
high-water-disk-pct 80 # How full may the disk become before the server begins eviction
high-water-memory-pct 70 # Evict non-zero TTL data if capacity exceeds # 70% of 35GB
stop-writes-pct 90 # Stop writes if capacity exceeds 90% of 35GB
storage-engine device {
device /dev/xvdb8
write-block-size 256K
}
}
Following were the observations -
----FC Namespace----
20 - servers, 6k Write TPS, 16K Read TPS
set latency = 10ms
set timeouts = 1
get latency = 15ms
get timeouts = 3
40 - servers, 12k Write TPS, 17K Read TPS
set latency = 12ms
set timeouts = 1
get latency = 20ms
get timeouts = 5
60 - servers, 17k Write TPS, 18K Read TPS
set latency = 25ms
set timeouts = 5
get latency = 30ms
get timeouts = 10-50 (fluctuating)
----HEAVYNAMESPACE----
20 - del servers, 6k Write TPS, 16K Read TPS
set latency = 7ms
set timeouts = 1
get latency = 5ms
get timeouts = 0
no of keys = 47 million x 2
disk usage = 121 gb
ram usage = 5.62 gb
40 - del servers, 12k Write TPS, 17K Read TPS
set latency = 15ms
set timeouts = 5
get latency = 12ms
get timeouts = 2
60 - del servers, 17k Write TPS, 18K Read TPS
set latency = 25ms
set timeouts = 25-75 (fluctuating)
get latency = 25ms
get timeouts = 2-15 (fluctuating)
* Set latency refers to latency in setting aerospike cache keys and similarly get for getting keys.
We had to turn off the namespace "HEAVYNAMESPACE" after reaching 60 servers.
We then started a fresh POC with a cluster having nodes which were r3.4xlarge instances of AWS (find details here https://aws.amazon.com/ec2/instance-types/), with the key difference in aerospike configuration being the usage of memory only for caching, hoping that it would give better performance. Here is the aerospike.conf file -
service {
user root
group root
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
pidfile /var/run/aerospike/asd.pid
service-threads 16
transaction-queues 16
transaction-threads-per-queue 4
proto-fd-max 15000
}
logging {
# Log file must be an absolute path.
file /var/log/aerospike/aerospike.log {
context any info
}
}
network {
service {
address any
port 3000
}
heartbeat {
mode mesh
port 3002 # Heartbeat port for this node.
# List one or more other nodes, one ip-address & port per line:
mesh-seed-address-port <IP> 3002
mesh-seed-address-port <IP> 3002
mesh-seed-address-port <IP> 3002
mesh-seed-address-port <IP> 3002
mesh-seed-address-port <IP> 3002
interval 250
timeout 10
}
fabric {
port 3001
}
info {
port 3003
}
}
namespace FC {
replication-factor 2
memory-size 30G
storage-engine memory
default-ttl 30d # 30 days, use 0 to never expire/evict.
high-water-memory-pct 80 # Evict non-zero TTL data if capacity exceeds # 70% of 15GB
stop-writes-pct 90 # Stop writes if capacity exceeds 90% of 15GB
}
We began with the FC namespace only, and decided to go ahead with the HEAVYNAMESPACE only if we saw significant improvements with the FC namespace, but we didn't. Here are the current observations with different combinations of node count and server count -
Current stats
Observation Point 1 - 4 nodes serving 130 servers.
Point 2 - 5 nodes serving 80 servers.
Point 3 - 5 nodes serving 100 servers.
These observation points are highlighted in the graphs below -
Get latency -
Set successes (giving a measure of the load handled by the cluster) -
We also observed that -
Total memory usage across cluster is 5.52 GB of 144 GB. Node-wise memory usage is ~ 1.10 GB out of 28.90 GB.
There were no observed write failures yet.
There were occasional get/set timeouts which looked fine.
No evicted objects.
Conclusion
We are not seeing the improvements we had expected, by using the memory-only configuration. We would like to get some pointers to be able to scale up with the same cost -
- via tweaking the aerospike configurations
- or by using some more suitable AWS instance type (even if that would lead to cost cutting).
Update
Output of top command on one of the aerospike servers, to show SI (Pointed out by #Sunil in his answer) -
$ top
top - 08:02:21 up 188 days, 48 min, 1 user, load average: 0.07, 0.07, 0.02
Tasks: 179 total, 1 running, 178 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.3%us, 0.1%sy, 0.0%ni, 99.4%id, 0.0%wa, 0.0%hi, 0.2%si, 0.0%st
Mem: 125904196k total, 2726964k used, 123177232k free, 148612k buffers
Swap: 0k total, 0k used, 0k free, 445968k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
63421 root 20 0 5217m 1.6g 4340 S 6.3 1.3 461:08.83 asd
If I am not wrong, the SI appears to be 0.2%. I checked the same on all the nodes of the cluster and it is 0.2% on one and 0.1% on the rest of the three.
Also, here is the output of the network stats on the same node -
$ sar -n DEV 10 10
Linux 4.4.30-32.54.amzn1.x86_64 (ip-10-111-215-72) 07/10/17 _x86_64_ (16 CPU)
08:09:16 IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
08:09:26 lo 12.20 12.20 5.61 5.61 0.00 0.00 0.00 0.00
08:09:26 eth0 2763.60 1471.60 299.24 233.08 0.00 0.00 0.00 0.00
08:09:26 IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
08:09:36 lo 12.00 12.00 5.60 5.60 0.00 0.00 0.00 0.00
08:09:36 eth0 2772.60 1474.50 300.08 233.48 0.00 0.00 0.00 0.00
08:09:36 IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
08:09:46 lo 17.90 17.90 15.21 15.21 0.00 0.00 0.00 0.00
08:09:46 eth0 2802.80 1491.90 304.63 245.33 0.00 0.00 0.00 0.00
08:09:46 IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
08:09:56 lo 12.00 12.00 5.60 5.60 0.00 0.00 0.00 0.00
08:09:56 eth0 2805.20 1494.30 304.37 237.51 0.00 0.00 0.00 0.00
08:09:56 IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
08:10:06 lo 9.40 9.40 5.05 5.05 0.00 0.00 0.00 0.00
08:10:06 eth0 3144.10 1702.30 342.54 255.34 0.00 0.00 0.00 0.00
08:10:06 IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
08:10:16 lo 12.00 12.00 5.60 5.60 0.00 0.00 0.00 0.00
08:10:16 eth0 2862.70 1522.20 310.15 238.32 0.00 0.00 0.00 0.00
08:10:16 IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
08:10:26 lo 12.00 12.00 5.60 5.60 0.00 0.00 0.00 0.00
08:10:26 eth0 2738.40 1453.80 295.85 231.47 0.00 0.00 0.00 0.00
08:10:26 IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
08:10:36 lo 11.79 11.79 5.59 5.59 0.00 0.00 0.00 0.00
08:10:36 eth0 2758.14 1464.14 297.59 231.47 0.00 0.00 0.00 0.00
08:10:36 IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
08:10:46 lo 12.00 12.00 5.60 5.60 0.00 0.00 0.00 0.00
08:10:46 eth0 3100.40 1811.30 328.31 289.92 0.00 0.00 0.00 0.00
08:10:46 IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
08:10:56 lo 9.40 9.40 5.05 5.05 0.00 0.00 0.00 0.00
08:10:56 eth0 2753.40 1460.80 297.15 231.98 0.00 0.00 0.00 0.00
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: lo 12.07 12.07 6.45 6.45 0.00 0.00 0.00 0.00
Average: eth0 2850.12 1534.68 307.99 242.79 0.00 0.00 0.00 0.00
From the above, I think the total number of packets handled per second should be 2850.12+1534.68 = 4384.8 (sum of rxpck/s and txpck/s) which is well within 250K packets per second, as mentioned in The Amazon EC2 deployment guide on the Aerospike site which is referred in #RonenBotzer's answer.
Update 2
I ran the asadm command followed by show latency on one of the nodes of the cluster and from the output, it appears that there is no latency beyond 1 ms for both reads and writes -
Admin> show latency
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~read Latency~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Node Time Ops/Sec >1Ms >8Ms >64Ms
. Span . . . .
ip-10-111-215-72.ec2.internal:3000 11:35:01->11:35:11 1242.1 0.0 0.0 0.0
ip-10-13-215-20.ec2.internal:3000 11:34:57->11:35:07 1297.5 0.0 0.0 0.0
ip-10-150-147-167.ec2.internal:3000 11:35:04->11:35:14 1147.7 0.0 0.0 0.0
ip-10-165-168-246.ec2.internal:3000 11:34:59->11:35:09 1342.2 0.0 0.0 0.0
ip-10-233-158-213.ec2.internal:3000 11:35:00->11:35:10 1218.0 0.0 0.0 0.0
Number of rows: 5
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~write Latency~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Node Time Ops/Sec >1Ms >8Ms >64Ms
. Span . . . .
ip-10-111-215-72.ec2.internal:3000 11:35:01->11:35:11 33.0 0.0 0.0 0.0
ip-10-13-215-20.ec2.internal:3000 11:34:57->11:35:07 37.2 0.0 0.0 0.0
ip-10-150-147-167.ec2.internal:3000 11:35:04->11:35:14 36.4 0.0 0.0 0.0
ip-10-165-168-246.ec2.internal:3000 11:34:59->11:35:09 36.9 0.0 0.0 0.0
ip-10-233-158-213.ec2.internal:3000 11:35:00->11:35:10 33.9 0.0 0.0 0.0
Number of rows: 5
Aerospike has several modes for storage that you can configure:
Data in memory with no persistence
Data in memory, persisted to disk
Data on SSD, primary index in memory (AKA Hybrid Memory architecture)
In-Memory Optimizations
Release 3.11 and release 3.12 of
Aerospike include several big performance improvements for in-memory namespaces.
Among these are a change to how partitions are represented, from a single red-black tree to sprigs (many sub-trees). The new config parameters partition-tree-sprigs and partition-tree-locks should be used appropriately. In your case, as r3.4xlarge instances have 122G of DRAM, you can afford the 311M of overhead associated with setting partition-tree-sprigs to the max value of 4096.
You should also consider the auto-pin=cpu setting, as well. This option does require Linux Kernal >= 3.19 which is part of Ubuntu >= 15.04 (but not many others yet).
Clustering Improvements
The recent releases 3.13 and 3.14 include a rewrite of the cluster manager. In general you should consider using the latest version, but I'm pointing out the aspects that will directly affect your performance.
EC2 Networking and Aerospike
You don't show the latency numbers of the cluster itself, so I suspect the problem is with the networking, rather than the nodes.
Older instance family types, such as the r3, c3, i2, come with ENIs - NICs which have a single transmit/receive queue. The software interrupts of cores accessing this queue may become a bottleneck as the number of CPUs increases, all of which need to wait for their turn to use the NIC. There's a knowledge base article in the Aerospike community discussion forum on using multiple ENIs with Aerospike to get around the limited performance capacity of the single ENI you initially get with such an instance. The Amazon EC2 deployment guide on the Aerospike site talks about using RPS to maximize TPS when you're in an instance that uses ENIs.
Alternatively, you should consider moving to the newer instances (r4, i3, etc) which come with multiqueue ENAs. These do not require RPS, and support higher TPS without adding extra cards. They also happen to have better chipsets, and cost significantly less than their older siblings (r4 is roughly 30% cheaper than r3, i3 is about 1/3 the price of the i2).
Your title is misleading. Please consider changing it. You moved from on-disk to in-memory.
mem+disk means data is both on disk and mem (using data-in-memory=true).
My best guess is that one CPU is bottlenecking to do network I/O.
You can take a look at the top output and see the si (software interrupts)
If one CPU is showing much higher than the other,
simplest thing you can try is RPS (Receive Packet Steering)
echo f|sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus
Once you confirm that its network bottlneck,
You can try ENA as suggested by #Ronen
Going into details,
When you had 15ms latency with only FC, assuming its low tps.
But when you added high load on HEAVYNAMESPACE in prod,
the latency kept increasing as you added more client nodes and hence tps.
Simlarly in you POC also, the latency increased with client nodes.
The latency is under 15ms even with 130 servers. Its partly good.
I am not sure if I understood your set_success graph. Assumign its in ktps.
Update:
After looking at the server side latency histogram, looks like server is doing fine.
Most likely it is a client issue. Check CPU and network on the client machine(s).
Indexing large matrixes seems to be taking FAR longer in 0.5 and 0.6 than 0.4.7.
For instance:
x = rand(10,10,100,4,4,1000) #Dummy array
tic()
r = squeeze(mean(x[:,:,1:80,:,:,56:800],(1,2,3,4,5)),(1,2,3,4,5))
toc()
Julia 0.5.0 -> elapsed time: 176.357068283 seconds
Julia 0.4.7 -> elapsed time: 1.19991952 seconds
Edit: as per requested, I've updated the benchmark to use BenchmarkTools.jl and wrap the code in a function:
using BenchmarkTools
function testf(x)
r = squeeze(mean(x[:,:,1:80,:,:,56:800],(1,2,3,4,5)),(1,2,3,4,5));
end
x = rand(10,10,100,4,4,1000) #Dummy array
#benchmark testf(x)
In 0.5.0 I get the following (with huge memory usage):
BenchmarkTools.Trial:
samples: 1
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 23.36 gb
allocs estimate: 1043200022
minimum time: 177.94 s (1.34% GC)
median time: 177.94 s (1.34% GC)
mean time: 177.94 s (1.34% GC)
maximum time: 177.94 s (1.34% GC)
In 0.4.7 I get:
BenchmarkTools.Trial:
samples: 11
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 727.55 mb
allocs estimate: 79
minimum time: 425.82 ms (0.06% GC)
median time: 485.95 ms (11.31% GC)
mean time: 482.67 ms (10.37% GC)
maximum time: 503.27 ms (11.22% GC)
Edit: Updated to use sub in 0.4.7 and view in 0.5.0
using BenchmarkTools
function testf(x)
r = mean(sub(x, :, :, 1:80, :, :, 56:800));
end
x = rand(10,10,100,4,4,1000) #Dummy array
#benchmark testf(x)
In 0.5.0 it ran for >20 mins and gave:
BenchmarkTools.Trial:
samples: 1
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 53.75 gb
allocs estimate: 2271872022
minimum time: 407.64 s (1.32% GC)
median time: 407.64 s (1.32% GC)
mean time: 407.64 s (1.32% GC)
maximum time: 407.64 s (1.32% GC)
In 0.4.7 I get:
BenchmarkTools.Trial:
samples: 5
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 1.28 kb
allocs estimate: 34
minimum time: 1.15 s (0.00% GC)
median time: 1.16 s (0.00% GC)
mean time: 1.16 s (0.00% GC)
maximum time: 1.18 s (0.00% GC)
This seems repeatable on other machines, so an issue has been opened: https://github.com/JuliaLang/julia/issues/19174
EDIT 17 March 2017 This regression is fixed in Julia v0.6.0. The discussion still applies if using older versions of Julia.
Try running this crude script in both Julia v0.4.7 and v0.5.0 (change sub to view):
using BenchmarkTools
function testf()
# set seed
srand(2016)
# test array
x = rand(10,10,100,4,4,1000)
# extract array view
y = sub(x, :, :, 1:80, :, :, 56:800) # julia v0.4
#y = view(x, :, :, 1:80, :, :, 56:800) # julia v0.5
# wrap mean(y) into a function
z() = mean(y)
# benchmark array mean
#time z()
#time z()
end
testf()
My machine:
julia> versioninfo()
Julia Version 0.4.7
Commit ae26b25 (2016-09-18 16:17 UTC)
Platform Info:
System: Darwin (x86_64-apple-darwin13.4.0)
CPU: Intel(R) Core(TM) i7-4870HQ CPU # 2.50GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.3
My output, Julia v0.4.7:
1.314966 seconds (246.43 k allocations: 11.589 MB)
1.017073 seconds (1 allocation: 16 bytes)
My output, Julia v0.5.0:
417.608056 seconds (2.27 G allocations: 53.749 GB, 0.75% gc time)
410.918933 seconds (2.27 G allocations: 53.747 GB, 0.72% gc time)
It would seem that you may have discovered a performance regression. Consider filing an issue.
I run gfortran 4.9.2 on a 64-bit Windows 7 machine with an Intel Core i5-4570 (Haswell). I compile and execute on this same machine.
Compiling my code (scientific simulation) with
gfortran -frecord-marker-4 -fno-automatic -O3 -fdefault-real-8 (...)
-Wline-truncation -Wsurprising -ffpe-trap=invalid,zero,overflow (...)
-march=core2 -mfpmath=sse -c
is about ~30% FASTER than compiling with
gfortran -frecord-marker-4 -fno-automatic -O3 -fdefault-real-8 (...)
-Wline-truncation -Wsurprising -ffpe-trap=invalid,zero,overflow (...)
-march=haswell -mfpmath=sse -c
(-march=native gives the same result as with -march=haswell).
This feels strange/weird to me, as I would expect having additional instructions available should make the code faster, not slower.
First: this a new machine and a replacement of my old one at work so unfortunately:
I can't test with the previous processor anymore
It is difficult for me test with another gfortran version than the one installed
Now, I did some profiling with gprof and different -march= settings (see gcc online listing). On this test:
core2, nehalem, westmere all lead to ~85s
starting from sandybridge (adding the AVX instruction set), execution time jumps to 122s (128s for haswell).
Here are the reported profiles, cut at functions > 1.0s self time.
Flat profile for -march=core2:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls us/call us/call name
8.92 6.18 6.18 __sinl_internal
8.50 12.07 5.89 __cosl_internal
7.26 17.10 5.03 _mcount_private
6.42 21.55 4.45 exp
6.41 25.99 4.44 exp2l
5.08 29.51 3.52 __fentry__
3.71 32.08 2.57 35922427 0.07 0.18 predn_
3.53 34.53 2.45 log2l
3.36 36.86 2.33 79418108 0.03 0.03 vxs_tvxs_
2.90 38.87 2.01 97875942 0.02 0.02 rk4m_
2.83 40.83 1.96 403671 4.86 77.44 radarx_
2.16 42.33 1.50 4063165 0.37 0.43 dchdd_
2.14 43.81 1.48 pow
2.11 45.27 1.46 8475809 0.17 0.27 aerosj_
2.09 46.72 1.45 23079874 0.06 0.06 snrm2_
1.86 48.01 1.29 cos
1.80 49.26 1.25 sin
1.75 50.47 1.21 15980084 0.08 0.08 sgemv_
1.66 51.62 1.15 61799016 0.02 0.05 x2acc_
1.64 52.76 1.14 43182542 0.03 0.03 atmostd_
1.56 53.84 1.08 24821235 0.04 0.04 axb_
1.53 54.90 1.06 138497449 0.01 0.01 axvc_
Flat profile for -march=haswell:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls us/call us/call name
6.49 6.71 6.71 __sinl_internal
6.05 12.96 6.25 __cosl_internal
5.55 18.70 5.74 _mcount_private
5.16 24.03 5.33 exp
5.14 29.34 5.31 cos
4.87 34.37 5.03 sin
4.67 39.20 4.83 exp2l
4.55 43.90 4.70 35922756 0.13 0.34 predn_
4.38 48.43 4.53 8475884 0.53 0.69 aerosj_
3.72 52.27 3.84 pow
3.43 55.82 3.55 __fentry__
2.79 58.70 2.88 403672 7.13 120.62 radarx_
2.64 61.43 2.73 79396558 0.03 0.03 vxs_tvxs_
2.36 63.87 2.44 log2l
1.95 65.89 2.02 97881202 0.02 0.02 rk4m_
1.80 67.75 1.86 12314052 0.15 0.15 axs_txs_
1.74 69.55 1.80 8475848 0.21 0.66 mvpd_
1.72 71.33 1.78 36345392 0.05 0.05 gauss_
1.53 72.91 1.58 25028687 0.06 0.06 aescudi_
1.52 74.48 1.57 43187368 0.04 0.04 atmostd_
1.44 75.97 1.49 23077428 0.06 0.06 snrm2_
1.43 77.45 1.48 17560212 0.08 0.08 txs_axs_
1.38 78.88 1.43 4062635 0.35 0.42 dchdd_
1.36 80.29 1.41 internal_modf
1.30 81.63 1.34 61800367 0.02 0.06 x2acc_
1.26 82.93 1.30 log
1.25 84.22 1.29 138497176 0.01 0.01 axvc_
1.24 85.50 1.28 15978523 0.08 0.08 sgemv_
1.10 86.64 1.14 10707022 0.11 0.11 ec_txs_
1.09 87.77 1.13 8475648 0.13 0.21 g_eval_
1.06 88.87 1.10 __logl_internal
0.98 89.88 1.01 17765874 0.06 0.07 solgeo_
0.98 90.89 1.01 15978523 0.06 0.06 sger_
You'll notice basically everything seems slower with -haswell (even internal functions like sin/cos/exp !).
I can give an example of code, the function vxs_tvxs, which consumes 2.73s vs 2.33s:
SUBROUTINE VXS_TVXS(VXS,TVXS)
REAL VXS(3),TVXS(3,3)
VTOT=sqrt(sum(VXS**2))
VH=sqrt(VXS(1)**2+VXS(2)**2)
if (VTOT==0.) then
print*,'PB VXS_TVXS : VTOT=',VTOT
stop
endif
sg=-VXS(3)/VTOT
cg=VH/VTOT
if (VH==0.) then
sc=0.
cc=1.
else
sc=VXS(2)/VH
cc=VXS(1)/VH
endif
TVXS(1,:)=(/ cg*cc, cg*sc, -sg/)
TVXS(2,:)=(/ -sc, cc, 0./)
TVXS(3,:)=(/ sg*cc, sg*sc, cg/)
RETURN
END
Seems quite an innocuous function to me...
I have made a very simple program
PROGRAM PIPO
REAL VXS0(3),VXS(3),TVXS(3,3)
VXS0=(/50.,100.,200./)
VXS=VXS0
call cpu_time(start)
do k=1,50 000 000
call VXS_TVXS(VXS,TVXS)
VXS=0.5*(VXS0+TVXS(1+mod(k,3),:))
VXS=cos(VXS)
enddo
call cpu_time(finish)
print*,finish-start,VXS
END
Unfortunately, in this test case, all -march settings end up with about the same time requirement.
So I really don't get what is happening... plus, as we see from the previous profile, the fact that even internal functions are costing more feels very puzzling.
This is a .NET v4 windows service application running on a x64 machine. At some point after days of running steadily the windows service memory consumption spikes up like crazy until it crashes. I was able to catch it at 1.2 GB and capture a memory dump. Here is what i get
If i run !address -summary in windbg on my dump file i get the follow result
!address -summary
--- Usage Summary ------ RgnCount ------- Total Size -------- %ofBusy %ofTotal
Free 821 7ff`7e834000 ( 7.998 Tb) 99.98%
<unclassified> 3696 0`6eece000 ( 1.733 Gb) 85.67% 0.02%
Image 1851 0`0ea6f000 ( 234.434 Mb) 11.32% 0.00%
Stack 1881 0`03968000 ( 57.406 Mb) 2.77% 0.00%
TEB 628 0`004e8000 ( 4.906 Mb) 0.24% 0.00%
NlsTables 1 0`00023000 ( 140.000 kb) 0.01% 0.00%
ActivationContextData 3 0`00006000 ( 24.000 kb) 0.00% 0.00%
CsrSharedMemory 1 0`00005000 ( 20.000 kb) 0.00% 0.00%
PEB 1 0`00001000 ( 4.000 kb) 0.00% 0.00%
-
-
-
--- Type Summary (for busy) -- RgnCount ----- Total Size ----- %ofBusy %ofTotal
MEM_PRIVATE 5837 0`7115a000 ( 1.767 Gb) 87.34% 0.02%
MEM_IMAGE 2185 0`0f131000 (241.191 Mb) 11.64% 0.00%
MEM_MAPPED 40 0`01531000 ( 21.191 Mb) 1.02% 0.00%
-
-
--- State Summary ------------ RgnCount ------ Total Size ---- %ofBusy %ofTotal
MEM_FREE 821 7ff`7e834000 ( 7.998 Tb) 99.98%
MEM_COMMIT 6127 0`4fd5e000 ( 1.247 Gb) 61.66% 0.02%
MEM_RESERVE 1935 0`31a5e000 (794.367 Mb) 38.34% 0.01%
-
-
--Protect Summary(for commit)- RgnCount ------ Total Size --- %ofBusy %ofTotal
PAGE_READWRITE 3412 0`3e862000 (1000.383 Mb) 48.29% 0.01%
PAGE_EXECUTE_READ 220 0`0b12f000 ( 177.184 Mb) 8.55% 0.00%
PAGE_READONLY 646 0`02fd0000 ( 47.813 Mb) 2.31% 0.00%
PAGE_WRITECOPY 410 0`01781000 ( 23.504 Mb) 1.13% 0.00%
PAGE_READWRITE|PAGE_GUARD 1224 0`012f2000 ( 18.945 Mb) 0.91% 0.00%
PAGE_EXECUTE_READWRITE 144 0`007b9000 ( 7.723 Mb) 0.37% 0.00%
PAGE_EXECUTE_WRITECOPY 70 0`001cd000 ( 1.801 Mb) 0.09% 0.00%
PAGE_EXECUTE 1 0`00004000 ( 16.000 kb) 0.00% 0.00%
-
-
--- Largest Region by Usage ----Base Address -------- Region Size ----------
Free 0`8fff0000 7fe`59050000 ( 7.994 Tb)
<unclassified> 0`80d92000 0`0f25e000 ( 242.367 Mb)
Image fe`f6255000 0`0125a000 ( 18.352 Mb)
Stack 0`014d0000 0`000fc000 (1008.000 kb)
TEB 0`7ffde000 0`00002000 ( 8.000 kb)
NlsTables 7ff`fffb0000 0`00023000 ( 140.000 kb)
ActivationContextData 0`00030000 0`00004000 ( 16.000 kb)
CsrSharedMemory 0`7efe0000 0`00005000 ( 20.000 kb)
PEB 7ff`fffdd000 0`00001000 ( 4.000 kb)
First, why would unclassified show up once as 1.73 GB and the other time as 242 MB. (This has been answered. Thank you)
Second, i understand that unclassified can mean managed code, however my heap size according to !eeheap is only 248 MB, which actually matches the 242 but not even close to the 1.73GB. The dump file size is 1.2 GB which is much higher than normal. Where do I go from here to find out what's using all the memory. Anything in the managed heap world is under 248 MB, but i'm using 1.2 GB.
Thanks
EDIT
If i do !heap -s i get the following
LFH Key : 0x000000171fab7f20
Termination on corruption : ENABLED
Heap Flags Reserv Commit Virt Free List UCR Virt Lock Fast
(k) (k) (k) (k) length blocks cont. heap
-------------------------------------------------------------------------------------
Virtual block: 00000000017e0000 - 00000000017e0000 (size 0000000000000000)
Virtual block: 0000000045bd0000 - 0000000045bd0000 (size 0000000000000000)
Virtual block: 000000006fff0000 - 000000006fff0000 (size 0000000000000000)
0000000000060000 00000002 113024 102028 113024 27343 1542 11 3 1c LFH
External fragmentation 26 % (1542 free blocks)
0000000000010000 00008000 64 4 64 1 1 1 0 0
0000000000480000 00001002 3136 1380 3136 20 8 3 0 0 LFH
0000000000640000 00041002 512 8 512 3 1 1 0 0
0000000000800000 00001002 3136 1412 3136 15 7 3 0 0 LFH
00000000009d0000 00001002 3136 1380 3136 19 7 3 0 0 LFH
00000000008a0000 00041002 512 16 512 3 1 1 0 0
0000000000630000 00001002 7232 3628 7232 18 53 4 0 0 LFH
0000000000da0000 00041002 1536 856 1536 1 1 2 0 0 LFH
0000000000ef0000 00041002 1536 944 1536 4 12 2 0 0 LFH
00000000034b0000 00001002 1536 1452 1536 6 17 2 0 0 LFH
00000000019c0000 00001002 3136 1396 3136 16 6 3 0 0 LFH
0000000003be0000 00001002 1536 1072 1536 5 7 2 0 3 LFH
0000000003dc0000 00011002 512 220 512 100 60 1 0 2
0000000002520000 00001002 512 8 512 3 2 1 0 0
0000000003b60000 00001002 339712 168996 339712 151494 976 116 0 18 LFH
External fragmentation 89 % (976 free blocks)
Virtual address fragmentation 50 % (116 uncommited ranges)
0000000003f20000 00001002 64 8 64 3 1 1 0 0
0000000003d90000 00001002 64 8 64 3 1 1 0 0
0000000003ee0000 00001002 64 16 64 11 1 1 0 0
-------------------------------------------------------------------------------------
I've recently had a very similar situation and found a couple techniques useful in the investigation. None is a silver bullet, but each sheds a little more light on the problem.
1) vmmap.exe from SysInternals (http://technet.microsoft.com/en-us/sysinternals/dd535533) does a good job of correlating information on native and managed memory and presenting it in a nice UI. The same information can be gathered using the techniques below, but this is way easier and a nice place to start. Sadly, it doesn't work on dump files, you need a live process.
2) The "!address -summary" output is a rollup of the more detailed "!address" output. I found it useful to drop the detailed output into Excel and run some pivots. Using this technique I discovered that a large number of bytes that were listed as "" were actually MEM_IMAGE pages, likely copies of data pages that were loaded when the DLLs were loaded but then copied when the data was changed. I could also filter to large regions and drill in on specific addresses. Poking around in the memory dump with a toothpick and lots of praying is painful, but can be revealing.
3) Finally, I did a poor man's version of the vmmap.exe technique above. I loaded up the dump file, opened a log, and ran !address, !eeheap, !heap, and !threads. I also targeted the thread environment blocks listed in ~*k with !teb. I closed the log file and loaded it up in my favorite editor. I could then find an unclassified block and search to see if it popped up in the output from one of the more detailed commands. You can pretty quickly correlate native and managed heaps to weed those out of your suspect unclassified regions.
These are all way too manual. I'd love to write a script that would take the output similar to what I generated in technique 3 above and output an mmp file suitable for viewing the vmmap.exe. Some day.
One last note: I did a correlation between vmmap.exe's output with the !address output and noted these types of regions that vmmap couple identify from various sources (similar to what !heap and !eeheap use) but that !address didn't know about. That is, these are things that vmmap.exe labeled but !address didn't:
.data
.pdata
.rdata
.text
64-bit thread stack
Domain 1
Domain 1 High Frequency Heap
Domain 1 JIT Code Heap
Domain 1 Low Frequency Heap
Domain 1 Virtual Call Stub
Domain 1 Virtual Call Stub Lookup Heap
Domain 1 Virtual Call Stub Resolve Heap
GC
Large Object Heap
Native heaps
Thread Environment Blocks
There were still a lot of "private" bytes unaccounted for, but again, I'm able to narrow the problem if I can weed these out.
Hope this gives you some ideas on how to investigate. I'm in the same boat so I'd appreciate what you find, too. Thanks!
“Usage summary” tells that you have 3696 regions of unclassified giving a total of 17.33 Gb
“Largest Region” tells that the largest of the unclassified regions is 242 Mb.
The rest of the unclassified (3695 regions) together makes the difference up to 17.33 Gb.
Try to do a !heap –s and sum up the Virt col to see the size of the native heaps, I think these also falls into the unmanaged bucket.
(NB earlier versions shows native heap explicit from !address -summary)
I keep a copy of Debugging Tools for Windows 6.11.1.404 which seems to be able to display something more meaningful for "unclassified"
With that version, I see a list of TEB addresses and then this:
0:000> !address -summary
--------- PEB fffde000 not found ----
TEB fffdd000 in range fffdb000 fffde000
TEB fffda000 in range fffd8000 fffdb000
...snip...
TEB fe01c000 in range fe01a000 fe01d000
ProcessParametrs 002c15e0 in range 002c0000 003c0000
Environment 002c0810 in range 002c0000 003c0000
-------------------- Usage SUMMARY --------------------------
TotSize ( KB) Pct(Tots) Pct(Busy) Usage
41f08000 ( 1080352) : 25.76% 34.88% : RegionUsageIsVAD
42ecf000 ( 1096508) : 26.14% 00.00% : RegionUsageFree
5c21000 ( 94340) : 02.25% 03.05% : RegionUsageImage
c900000 ( 205824) : 04.91% 06.64% : RegionUsageStack
0 ( 0) : 00.00% 00.00% : RegionUsageTeb
68cf8000 ( 1717216) : 40.94% 55.43% : RegionUsageHeap
0 ( 0) : 00.00% 00.00% : RegionUsagePageHeap
0 ( 0) : 00.00% 00.00% : RegionUsagePeb
0 ( 0) : 00.00% 00.00% : RegionUsageProcessParametrs
0 ( 0) : 00.00% 00.00% : RegionUsageEnvironmentBlock
Tot: ffff0000 (4194240 KB) Busy: bd121000 (3097732 KB)
-------------------- Type SUMMARY --------------------------
TotSize ( KB) Pct(Tots) Usage
42ecf000 ( 1096508) : 26.14% : <free>
5e6e000 ( 96696) : 02.31% : MEM_IMAGE
28ed000 ( 41908) : 01.00% : MEM_MAPPED
b49c6000 ( 2959128) : 70.55% : MEM_PRIVATE
-------------------- State SUMMARY --------------------------
TotSize ( KB) Pct(Tots) Usage
9b4d1000 ( 2544452) : 60.67% : MEM_COMMIT
42ecf000 ( 1096508) : 26.14% : MEM_FREE
21c50000 ( 553280) : 13.19% : MEM_RESERVE
Largest free region: Base bc480000 - Size 38e10000 (931904 KB)
With my "current" version (6.12.2.633) I get this from the same dump. Two things I note:
The data seems to be the sum of the HeapAlloc/RegionUsageHeap and VirtualAlloc/RegionUsageIsVAD).
The lovely EFAIL error which is no doubt in part responsible for the missing data!
I'm not sure how that'll help you with your managed code, but I think it actually answers the original question ;-)
0:000> !address -summary
Failed to map Heaps (error 80004005)
--- Usage Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
<unclassified> 7171 aab21000 ( 2.667 Gb) 90.28% 66.68%
Free 637 42ecf000 ( 1.046 Gb) 26.14%
Stack 603 c900000 ( 201.000 Mb) 6.64% 4.91%
Image 636 5c21000 ( 92.129 Mb) 3.05% 2.25%
TEB 201 c9000 ( 804.000 kb) 0.03% 0.02%
ActivationContextData 14 11000 ( 68.000 kb) 0.00% 0.00%
CsrSharedMemory 1 5000 ( 20.000 kb) 0.00% 0.00%
--- Type Summary (for busy) ------ RgnCount ----------- Total Size -------- %ofBusy %ofTotal
MEM_PRIVATE 7921 b49c6000 ( 2.822 Gb) 95.53% 70.55%
MEM_IMAGE 665 5e6e000 ( 94.430 Mb) 3.12% 2.31%
MEM_MAPPED 40 28ed000 ( 40.926 Mb) 1.35% 1.00%
--- State Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
MEM_COMMIT 5734 9b4d1000 ( 2.427 Gb) 82.14% 60.67%
MEM_FREE 637 42ecf000 ( 1.046 Gb) 26.14%
MEM_RESERVE 2892 21c50000 ( 540.313 Mb) 17.86% 13.19%
--- Protect Summary (for commit) - RgnCount ----------- Total Size -------- %ofBusy %ofTotal
PAGE_READWRITE 4805 942bd000 ( 2.315 Gb) 78.37% 57.88%
PAGE_READONLY 215 3cbb000 ( 60.730 Mb) 2.01% 1.48%
PAGE_EXECUTE_READ 78 2477000 ( 36.465 Mb) 1.21% 0.89%
PAGE_WRITECOPY 74 75b000 ( 7.355 Mb) 0.24% 0.18%
PAGE_READWRITE|PAGE_GUARD 402 3d6000 ( 3.836 Mb) 0.13% 0.09%
PAGE_EXECUTE_READWRITE 80 3b0000 ( 3.688 Mb) 0.12% 0.09%
PAGE_EXECUTE_WRITECOPY 80 201000 ( 2.004 Mb) 0.07% 0.05%
--- Largest Region by Usage ----------- Base Address -------- Region Size ----------
<unclassified> 786000 17d9000 ( 23.848 Mb)
Free bc480000 38e10000 ( 910.063 Mb)
Stack 6f90000 fd000 (1012.000 kb)
Image 3c3c000 ebe000 ( 14.742 Mb)
TEB fdf8f000 1000 ( 4.000 kb)
ActivationContextData 190000 4000 ( 16.000 kb)
CsrSharedMemory 7efe0000 5000 ( 20.000 kb)
You're best bet would be to use the EEHeap and GCHandles commands in windbg (http://msdn.microsoft.com/en-us/library/bb190764.aspx) and try to see if you can find what might be leaking/wrong that way.
Unfortunately you probably won't be able to get the exact help you're looking for due to the fact that diagnosing these types of issues is almost always very time intensive and outside of the simplest cases requires someone to do a full analysis on the dump. Basically it's unlikely that someone will be able to point you towards a direct answer on Stack overflow. Mostly people will be able to point you commands that might be helpful. You're going to have to do a lot of digging to find out more information on what is happening.
I recently spent some time diagnosing a customers issue where their app was using 70GB before terminating (likely due to hitting an IIS App Pool recycling limit, but still unconfirmed). They sent me a 35 GB memory dump. Based on my recent experience, here are some observations I can make about what you've provided:
In the !heap -s output, 284 MB of the 1.247 GB is shown in the Commit column. If you were to open this dump in DebugDiag it would tell you that heap 0x60000 has 1 GB committed memory. You'll add up the commit size of the 11 segments reported and find that they only add up to about 102 MB and not 1GB. So annoying.
The "missing" memory isn't missing. It's actually hinted at in the !heap -s output as "Virtual block:" lines. Unfortunately, !heap -s sucks and doesn't show the end address properly and therefore reports size as 0. Check the output of the following commands:
!address 17e0000
!address 45bd0000
!address 6fff0000
It will report the proper end address and therefore an accurate "Region Size". Even better, it gives a succinct version of the region size. If you add the size of those 3 regions to 102 MB, you should be pretty close to 1 GB.
So what's in them? Well, you can look using dq. By spelunking you might find a hint at why they were allocated. Perhaps your managed code calls some 3rd party code which has a native side.
You might be able to find references to your heap by using !heap 6fff0000 -x -v. If there are references you can see what memory regions they live in by using !address again. In my customer issue I found a reference that lived on a region with "Usage: Stack". A "More info: " hint referenced the stack's thread which happened to have some large basic_string append/copy calls at the top.
I am trying to get ddply to run in parallel on my mac. The code I've used is as follows:
library(doMC)
library(ggplot2) # for the purposes of getting the baseball data.frame
registerDoMC(2)
> system.time(ddply(baseball, .(year), numcolwise(mean)))
user system elapsed
0.959 0.106 1.522
> system.time(ddply(baseball, .(year), numcolwise(mean), .parallel=TRUE))
user system elapsed
2.221 2.790 2.552
Why is ddply slower when I run .parallel=TRUE? I have searched online to no avail. I've also tried registerDoMC() and the results were the same.
The baseball data may be too small to see improvement by making the computations parallel; the overhead of passing the data to the different processes may be swamping any speedup by doing the calculations in parallel. Using the rbenchmark package:
baseball10 <- baseball[rep(seq(length=nrow(baseball)), 10),]
benchmark(noparallel = ddply(baseball, .(year), numcolwise(mean)),
parallel = ddply(baseball, .(year), numcolwise(mean), .parallel=TRUE),
noparallel10 = ddply(baseball10, .(year), numcolwise(mean)),
parallel10 = ddply(baseball10, .(year), numcolwise(mean), .parallel=TRUE),
replications = 10)
gives results
test replications elapsed relative user.self sys.self user.child sys.child
1 noparallel 10 4.562 1.000000 4.145 0.408 0.000 0.000
3 noparallel10 10 14.134 3.098203 9.815 4.242 0.000 0.000
2 parallel 10 11.927 2.614423 2.394 1.107 4.836 6.891
4 parallel10 10 18.406 4.034634 4.045 2.580 10.210 9.769
With a 10 times bigger data set, the penalty for parallel is smaller. A more complicated computation would also tilt it even further in parallel's favor, likely giving it an advantage.
This was run on a Mac OS X 10.5.8 Core 2 Duo machine.
Running in parallel will be slower than running sequentially when the communication costs between the nodes is greater than the calculation time of the function. In other words, it takes longer to send the data to/from the nodes than it does to perform the calculation.
For the same data set, the communication costs are approximately fixed, so parallel processing is going to be more useful as the time spent evaluating the function increases.
UPDATE:
The code below shows 0.14 seconds (on my machine) are spent is spent evaluating .fun. That means communication has to be less than 0.07 seconds and that's not realistic for a data set the size of baseball.
Rprof()
system.time(ddply(baseball, .(year), numcolwise(mean)))
# user system elapsed
# 0.28 0.02 0.30
Rprof(NULL)
summaryRprof()$by.self
# self.time self.pct total.time total.pct
# [.data.frame 0.04 12.50 0.10 31.25
# unlist 0.04 12.50 0.10 31.25
# match 0.04 12.50 0.04 12.50
# .fun 0.02 6.25 0.14 43.75
# structure 0.02 6.25 0.12 37.50
# [[ 0.02 6.25 0.08 25.00
# FUN 0.02 6.25 0.06 18.75
# rbind.fill 0.02 6.25 0.06 18.75
# anyDuplicated 0.02 6.25 0.02 6.25
# gc 0.02 6.25 0.02 6.25
# is.array 0.02 6.25 0.02 6.25
# list 0.02 6.25 0.02 6.25
# mean.default 0.02 6.25 0.02 6.25
Here's the parallel version with snow:
library(doSNOW)
cl <- makeSOCKcluster(2)
registerDoSNOW(cl)
Rprof()
system.time(ddply(baseball, .(year), numcolwise(mean), .parallel=TRUE))
# user system elapsed
# 0.46 0.01 0.73
Rprof(NULL)
summaryRprof()$by.self
# self.time self.pct total.time total.pct
# .Call 0.24 33.33 0.24 33.33
# socketSelect 0.16 22.22 0.16 22.22
# lazyLoadDBfetch 0.08 11.11 0.08 11.11
# accumulate.iforeach 0.04 5.56 0.06 8.33
# rbind.fill 0.04 5.56 0.06 8.33
# structure 0.04 5.56 0.04 5.56
# <Anonymous> 0.02 2.78 0.54 75.00
# lapply 0.02 2.78 0.04 5.56
# constantFoldEnv 0.02 2.78 0.02 2.78
# gc 0.02 2.78 0.02 2.78
# stopifnot 0.02 2.78 0.02 2.78
# summary.connection 0.02 2.78 0.02 2.78