Golang Alloc and HeapAlloc vs pprof large discrepancies - go

I have a Go program that calculates large correlation matrices in memory. To do this I've set up a pipeline of 3 goroutines where the first reads in files, the second calculates the correlation matrix and the last stores the result to disk.
Problem is, when I run the program, the Go runtime allocates ~17GB of memory while a matrix only takes up ~2-3GB. Using runtime.ReadMemStats shows that the program is using ~17GB (and verified by using htop), but pprof only reports about ~2.3GB.
If I look at the mem stats after running one file through the pipeline:
var mem runtime.MemStats
runtime.ReadMemStats(&mem)
fmt.Printf("Total alloc: %d GB\n", mem.Alloc/1000/1000/1000)
This shows the total allocation of the program:
Total alloc: 17 GB
However, if I run go tool pprof mem.prof I get the following results:
(pprof) top5
Showing nodes accounting for 2.21GB, 100% of 2.21GB total
Showing top 5 nodes out of 9
flat flat% sum% cum cum%
1.20GB 54.07% 54.07% 1.20GB 54.07% dataset.(*Dataset).CalcCorrelationMatrix
1.02GB 45.93% 100% 1.02GB 45.93% bytes.makeSlice
0 0% 100% 1.02GB 45.93% bytes.(*Buffer).WriteByte
0 0% 100% 1.02GB 45.93% bytes.(*Buffer).grow
0 0% 100% 1.02GB 45.93% encoding/json.Indent
So I am wondering how I can go about to find out why the program allocates 17 GB, when it seems that the peak memory usage is only ~2.5GB?
Is there a way to trace the memory usage throughout the program using pprof?
EDIT
I ran the program again with GODEBUG=gctrace=1 and got the following trace:
gc 1 #0.017s 0%: 0.005+0.55+0.003 ms clock, 0.022+0/0.47/0.11+0.012 ms cpu, 1227->1227->1226 MB, 1228 MB goal, 4 P
gc 2 #14.849s 0%: 0.003+1.7+0.004 ms clock, 0.015+0/1.6/0.11+0.018 ms cpu, 1227->1227->1227 MB, 2452 MB goal, 4 P
gc 3 #16.850s 0%: 0.006+60+0.003 ms clock, 0.027+0/0.46/59+0.015 ms cpu, 1876->1876->1712 MB, 2455 MB goal, 4 P
gc 4 #22.861s 0%: 0.005+238+0.003 ms clock, 0.021+0/0.46/237+0.015 ms cpu, 3657->3657->3171 MB, 3658 MB goal, 4 P
gc 5 #30.716s 0%: 0.005+476+0.004 ms clock, 0.022+0/0.44/476+0.017 ms cpu, 5764->5764->5116 MB, 6342 MB goal, 4 P
gc 6 #46.023s 0%: 0.005+949+0.004 ms clock, 0.020+0/0.47/949+0.017 ms cpu, 10302->10302->9005 MB, 10303 MB goal, 4 P
gc 7 #64.878s 0%: 0.006+382+0.004 ms clock, 0.024+0/0.46/382+0.019 ms cpu, 16548->16548->7728 MB, 18011 MB goal, 4 P
gc 8 #89.774s 0%: 0.86+2805+0.006 ms clock, 3.4+0/24/2784+0.025 ms cpu, 20208->20208->17088 MB, 20209 MB goal, 4 P
So it is quite obvious that the heap grows steadily through the program, but I am not able to pinpoint where. I've profiled memory usage using pprof.WriteHeapProfile after calling the memory intensive functions:
func memoryProfile(profpath string) {
if _, err := os.Stat(profpath); os.IsNotExist(err) {
os.Mkdir(profpath, os.ModePerm)
}
f, err := os.Create(path.Join(profpath, "mem.mprof"))
fmt.Printf("Creating memory profile in %s", "data/profile/mem.mprof\n")
if err != nil {
panic(err)
}
if err := pprof.WriteHeapProfile(f); err != nil {
panic(err)
}
f.Close()
}

As mentioned in the comments by JimB, the go profile is a sampling profiler and samples memory usage at certain intervals. In my case the sampling was not frequent enough to catch a function (JSON marshalling) that was using extensive amounts of memory.
Increasing the sample rate of the profiler by setting the environment variable
$ export GODEBUG=memprofilerate=1
Will updateruntime.MemProfileRateand the profile now includes every allocated block.

A possible solution (as it was in my case) is that the binary was compiled with -race, which enables checking for race conditions.
The overhead for this is huge and will look like a massive memory leak if checking with htop or something similar, but won't show in any pprof output

Related

Latency of accessing main memory is almost the same order of sending a packet

Looking at Jeff Dean's famous latency guides
Latency Comparison Numbers (~2012)
----------------------------------
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory reference 100 ns 20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy 3,000 ns 3 us
Send 1K bytes over 1 Gbps network 10,000 ns 10 us
Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD
Read 1 MB sequentially from memory 250,000 ns 250 us
Round trip within same datacenter 500,000 ns 500 us
Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memory
Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip
Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory, 20X SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms
One thing which looks somewhat uncanny to me is the time taken to read 1MB sequentially from disk is only 10 times faster than sending a round trip packet across the Atlantic. Can anyone give me more intuition why this feels right.
Q : 1MB SEQ-HDD-READ ~ 10x faster than a CA/NL trans-atlantic RTT - why this feels right?
Some "old" values ( with a few cross-QPI/NUMA updates from 2017 ) to start from:
0.5 ns - CPU L1 dCACHE reference
1 ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance
5 ns - CPU L1 iCACHE Branch mispredict
7 ns - CPU L2 CACHE reference
71 ns - CPU cross-QPI/NUMA best case on XEON E5-46*
100 ns - MUTEX lock/unlock
100 ns - CPU own DDR MEMORY reference
135 ns - CPU cross-QPI/NUMA best case on XEON E7-*
202 ns - CPU cross-QPI/NUMA worst case on XEON E7-*
325 ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
10,000 ns - Compress 1 KB with Zippy PROCESS (+GHz,+SIMD,+multicore tricks)
20,000 ns - Send 2 KB over 1 Gbps NETWORK
250,000 ns - Read 1 MB sequentially from MEMORY
500,000 ns - Round trip within a same DataCenter
10,000,000 ns - DISK seek
10,000,000 ns - Read 1 MB sequentially from NETWORK
30,000,000 ns - Read 1 MB sequentially from DISK
150,000,000 ns - Send a NETWORK packet CA -> Netherlands
| | | |
| | | ns|
| | us|
| ms|
Trans-Atlantic Network RTT :
Global optical networks work roughly at a speed of light ( 300.000.000 m/s )
LA(CA)-AMS(NL) packet has to travel not the geodetical "distance", but over a set of continental and trans-atlantic "submarine" cables, the length of which is way longer ( see the map )
These factors do not "improve" - only the transport capacity is growing, with add-on latencies introduced in light-amplifiers, retiming units and other L1-PHY / L2-/L3-networking technologies are kept under control, as small as possible.
So the LA(CA)-AMS(NL) RTT will remain, using this technology, the same ~ 150 ms
Using other technology, LEO-Sat Cubes - as an example - the "distance" will only grow from ~ 9000 km P2P, by a pair of additional GND/LEO segments, plus by a few addition LEO/LEO hops, which introduce "longer" distance, add-on hop/hop re-processing latencies and capacity will not get any close to the current optical transports available, so no magic jump "back to the future" is to be expected ( we still miss the DeLorean ).
The HDD Disk :
HDD-s can have very fast and very short transport-path for moving the data, but the READ-ops have to wait for the physical / mechanical operations of the media-reading heads ( that takes most of the time here, not the actual data-transfer to the host RAM )
HDD-s are rotational devices, the disk has to "align" where to start the read, which costs the first about 10 [ms]
HDD-s devices store data into a static structure of heads( 2+, reading physical signals from the magnetic plates' surfaces ):cylinders( concentric circular zones on the plate, into which a cyl-aligned reading-head gets settled by disk-head micro-controller):sector( angular-sections of the cylinder, each carrying a block of the same sized data ~ 4KB, 8KB, ... )
These factors do not "improve" - all commodity produced drives remain at industry selected angular speeds of about { 5k4 | 7k2 | 10k | 15k | 18k }-spins/min (RPM). This means, that if a well-compacted data-layouts are maintained on such a disk, one continuous head:cylinder aligned reading round the whole cylinder will take:
>>> [ 1E3 / ( RPM / 60. ) for RPM in ( 5400, 7200, 10000, 15000, 18000 ) ]
11.1 ms per CYL # 5k4 RPM disk,
8.3 ms per CYL # 7k2 RPM disk,
6.0 ms per CYL # 10k RPM disk,
4.0 ms per CYL # 15k RPM disk,
3.3 ms per CYL # 18k RPM disk.
Data-density is also limited by the magnetic media properties. Spintronics R&D will bring some more densely stored data, yet the last 30 years have been well inside the limits of the reliable magnetic storage.
More is to be expected from a trick to co-parallel-read from several heads at-once, yet this goes against the design of the embedded microcontrollers, so most of the reading goes but sequentially, from one head after another, into the HDD-controller onboard buffers, best if no cyl-to-cyl heads mechanical re-alignment were to take place ( technically this depends on the prior data-to-disc layout, maintained by the O/S and possible care of disk-optimisers ( originally called disk disk-"compression", which just tried to re-align the known sequences of FAT-described data-blocks, so as to follow the most optimal trajectory of head:cyl:sector transitions, depending most on the actual device's head:head and cyl:cyl latencies ). So even the most optimistic data-layout takes ~ 13..21 [ms] to seek-and-read but one head:cyl-path
Laws of Physics decide
Some numbers from 2020.
Load from L1 is 4 cycles on Intel Coffee Lake and Ryzen (0.8nsec on a 5GHz CPU).
Load from memory is ~215 cycles on Intel Coffee Lake (43nsec on a 5GHz CPU). ~280 cycles on Ryzen.

Spark Scratch Space

I have a cluster of 13 machines with 4 physical CPUs and 24 G of RAM.
I started a spark cluster with one driver and 12 slaves.
I set the number of cores by slaves to 12 cores, meaning I have a cluster as foloowing :
Alive Workers: 12
Cores in use: 144 Total, 110 Used
Memory in use: 263.9 GB Total, 187.0 GB Used
I started an application with the folowing configuration :
[('spark.driver.cores', '4'),
('spark.executor.memory', '15G'),
('spark.executor.id', 'driver'),
('spark.driver.memory', '5G'),
('spark.python.worker.memory', '1042M'),
('spark.cores.max', '96'),
('spark.rdd.compress', 'True'),
('spark.serializer.objectStreamReset', '100'),
('spark.executor.cores', '8'),
('spark.default.parallelism', '48')]
I understand there are 15G of RAM by executor with 8 task slot and a parallelism of 48 (48 = 6 task slot * 12 slaves).
then I have two big files on HDFS : 6 G each, (from a directory of 12 files of 5 blocks of 128 Mb each) , with a 3x replication factor.
I union these two files => I get one dataframe of 12 GB I think but I see a 37 G reading input through the IHM :
That could be the first question : Why 37 Gb ?
Then as the execution time is too long for me, I try to cache the data so that I can go faster. But the caching method never finishes, here you can see it is already 45 minutes before the end (Vs 6 min not cached !):
So I try to understand why, and I see the usage of Memory/Disk on the storage section of the ihm :
So there are some part of the RDD that are staying on disk.
Furthemore I see the executors may still have free memory :
And I notice on the same "storage" page that the size of the RDD has jumped :
Storage Level: Disk Serialized 1x Replicated
Cached Partitions: 72
Total Partitions: 72
Memory Size: 42.7 GB
Disk Size: 73.3 GB
=> I understand : Memory Size: 42.7 GB + Disk Size: 73.3 GB = 110 G !
=> So my 6 G file has transformed on 37 G and then on 110 G ???
But i try to understand why is there still some memory left on my executor, and I go to the "err" dump of one, and I see :
18/02/08 11:04:08 INFO MemoryStore: Will not store rdd_50_46
18/02/08 11:04:09 WARN MemoryStore: Not enough space to cache rdd_50_46 in memory! (computed 1134.1 MB so far)
18/02/08 11:04:09 INFO MemoryStore: Memory use = 1641.6 KB (blocks) + 7.7 GB (scratch space shared across 6 tasks(s)) = 7.7 GB. Storage limit = 7.8 GB.
18/02/08 11:04:09 WARN BlockManager: Persisting block rdd_50_46 to disk instead.
And Here I see that the executor want to cache a 1641.6 KB block (only 1Mo !) and I can't because there is a ["scratch space"] of 7.7 Gb "shared across 6 tasks".
=> What is a "scratch space" ? ?
=> The 6 tasks => comes from the parallelism of 48 / 12 = 6
And then I come back to the app information, and I see that the count that lasted 48 min read only 37 Gb of data ! (The 48 min are clearly used to cache the data too)
When I do a count on the cached dataframe I have a 116G input read :
And at the end of the day, the time saved by the cached count is not that impressive, here are 3 duration :
4.8 ' : count on cached df
48' : count while caching
5.8' : count on not cached df (read directly from hdfs)
So why is it so ?
Because the cached df is not that much cached :
Meaning more or less 40 Gb in memory and 60 Gb on disk.
I am surprised because at 15G / executor * 12 slaves => 180 Gb of memory, and I can cache only 40 Gb ... But in fact I remember that the memory is splitted :
30% for spark
54% for storage
16% for shuffle
So I understand that I do have 54% * 15G for storage, ie 8.1 G, meaning that on my 180 Gb, I only have 97 Gb for storage. Why do I have 90 - 40 = 50 G not used then ?
Oups... This is a long post !
Plenty of questions... Sorry...

pprof usage and interpretation

We believe our go app has a memory leak.
In order to find out what's going on, we are trying with pprof.
We are having a hard time though understanding the reads.
When connecting to go tool pprof http://localhost:6060/debug/pprof/heap?debug=1, a sample output is
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) text
Showing nodes accounting for 17608.45kB, 100% of 17608.45kB total
Showing top 10 nodes out of 67
flat flat% sum% cum cum%
12292.12kB 69.81% 69.81% 12292.12kB 69.81% github.com/acct/repo/vendor/github.com/.../funcA /../github.com/acct/repo/vendor/github.com/../fileA.go
1543.14kB 8.76% 78.57% 1543.14kB 8.76% github.com/acct/repo/../funcB /../github.com/acct/repo/fileB.go
1064.52kB 6.05% 84.62% 1064.52kB 6.05% github.com/acct/repo/vendor/github.com/../funcC /../github.com/acct/repo/vendor/github.com/fileC.go
858.34kB 4.87% 89.49% 858.34kB 4.87% github.com/acct/repo/vendor/golang.org/x/tools/imports.init /../github.com/acct/repo/vendor/golang.org/x/tools/imports/zstdlib.go
809.97kB 4.60% 94.09% 809.97kB 4.60% bytes.makeSlice /usr/lib/go/src/bytes/buffer.go
528.17kB 3.00% 97.09% 528.17kB 3.00% regexp.(*bitState).reset /usr/lib/go/src/regexp/backtrack.go
(Please forgive the clumsy obfuscation)
We interpret funcA to be consuming nearly 70% of memory - but this being around 12MB.
Now top though shows:
top - 18:09:44 up 2:02, 1 user, load average: 0,75, 0,56, 0,38
Tasks: 166 total, 1 running, 165 sleeping, 0 stopped, 0 zombie
%Cpu(s): 3,7 us, 1,6 sy, 0,0 ni, 94,3 id, 0,0 wa, 0,0 hi, 0,3 si, 0,0 st
KiB Mem : 16318684 total, 14116728 free, 1004804 used, 1197152 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 14451260 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4902 me 20 0 1,371g 0,096g 0,016g S 12,9 0,6 1:58.14 mybin
which suggests 1.371 GB of memory used....where is it gone???
Also, pprof docs are quite frugal. We are having difficulties even to understand how it should be used. Our binary is a daemon. For example:
If we start a reading with go tool pprof http://localhost:6060/debug/pprof/heap, is this a one time reading at this particular time or an aggregate over time?
Sometimes hitting text later again in interactive mode seems to report the same values. Are we actually looking at the same values? Do we need to restart go tool pprof... to get fresh values?
Is it a reading of the complete heap, or of some specific go routine, of a specific point in the stack....???
Finally, is this interpretation correct (from http://localhost:6060/debug/pprof/):
/debug/pprof/
profiles:
0 block
64 goroutine
45 heap
0 mutex
13 threadcreate
The binary has 64 open go routines and a total of 45MB of heap memory?

Reading Go gctrace output

I have gctrace output that looks like this:
gc 6 #48.155s 15%: 0.093+12360+0.32 ms clock, 0.18+7720/21356/3615+0.65 ms cpu, 11039->13278->6876 MB, 14183 MB goal, 8 P
I am not sure how to read the CPU times in particular. I understand that it is broken down into three phases (STW sweep termination, concurrent mark/scan, and STW mark termination), but I'm not sure what the + signs mean (i.e. 0.18+7720 and 3615+0.65). What do these + signs signify?
In your case, they look like assist and termination times;
// CPU time
0.18 : **STW** Sweep termination.
7720ms : Mark/Scan - Assist Time (GC performed in line with allocation).
21356ms : Mark/Scan - Background GC time.
3615ms : Mark/Scan - Idle GC time.
0.65ms : **STW** Mark termination.
I think it changes (or it may) over various Go versions and you can find more detailed info at runtime package docs.
Currently, it is:
gc # ##s #%: #+#+# ms clock, #+#/#/#+# ms cpu, #->#-># MB, # MB goal, # P
where the fields are as follows:
gc # the GC number, incremented at each GC
##s time in seconds since program start
#% percentage of time spent in GC since program start
#+...+# wall-clock/CPU times for the phases of the GC
#->#-># MB heap size at GC start, at GC end, and live heap
# MB goal goal heap size
# P number of processors used
Example here
See also Interpreting GC trace output
gc 6 #48.155s 15%: 0.093+12360+0.32 ms clock,
0.18+7720/21356/3615+0.65 ms cpu, 11039->13278->6876 MB, 14183 MB goal, 8 P
gc 6
#48.155s since program start
15%: of time spent in GC since program start
0.093+12360+0.32 ms clock stop-the-world (STW) sweep termination + concurrent
mark and scan + and STW mark termination
0.18+7720/21356/3615+0.65 ms cpu (GC performed in
line with allocation), background GC time, and idle GC time
11039->13278->6876 MB heap size at GC start, at GC end, and live heap
8 P number of processors used

Should the stats reported by Go's runtime.ReadMemStats approximately equal the resident memory set reported by ps aux?

In Go Should the "Sys" stat or any other stat/combination reported by runtime.ReadMemStats approximately equal the resident memory set reported by ps aux?
Alternatively, assuming some memory may be swapped out, should the Sys stat be approximately greater than or equal to the RSS?
We have a long-running web service that deals with a high frequency of requests and we are finding that the RSS quickly climbs up to consume virtually all of the 64GB memory on our servers. When it hits ~85% we begin to experience considerable degradation in our response times and in how many concurrent requests we can handle. The run I've listed below is after about 20 hours of execution, and is already at 51% memory usage.
I'm trying to determine if the likely cause is a memory leak (we make some calls to CGO). The data seems to indicate that it is, but before I go down that rabbit hole I want to rule out a fundamental misunderstanding of the statistics I'm using to make that call.
This is an amd64 build targeting linux and executing on CentOS.
Reported by runtime.ReadMemStats:
Alloc: 1294777080 bytes (1234.80MB) // bytes allocated and not yet freed
Sys: 3686471104 bytes (3515.69MB) // bytes obtained from system (sum of XxxSys below)
HeapAlloc: 1294777080 bytes (1234.80MB) // bytes allocated and not yet freed (same as Alloc above)
HeapSys: 3104931840 bytes (2961.09MB) // bytes obtained from system
HeapIdle: 1672339456 bytes (1594.87MB) // bytes in idle spans
HeapInuse: 1432592384 bytes (1366.23MB) // bytes in non-idle span
Reported by ps aux:
%CPU %MEM VSZ RSS
1362 51.3 306936436 33742120

Resources