Reading Go gctrace output - go

I have gctrace output that looks like this:
gc 6 #48.155s 15%: 0.093+12360+0.32 ms clock, 0.18+7720/21356/3615+0.65 ms cpu, 11039->13278->6876 MB, 14183 MB goal, 8 P
I am not sure how to read the CPU times in particular. I understand that it is broken down into three phases (STW sweep termination, concurrent mark/scan, and STW mark termination), but I'm not sure what the + signs mean (i.e. 0.18+7720 and 3615+0.65). What do these + signs signify?

In your case, they look like assist and termination times;
// CPU time
0.18 : **STW** Sweep termination.
7720ms : Mark/Scan - Assist Time (GC performed in line with allocation).
21356ms : Mark/Scan - Background GC time.
3615ms : Mark/Scan - Idle GC time.
0.65ms : **STW** Mark termination.
I think it changes (or it may) over various Go versions and you can find more detailed info at runtime package docs.
Currently, it is:
gc # ##s #%: #+#+# ms clock, #+#/#/#+# ms cpu, #->#-># MB, # MB goal, # P
where the fields are as follows:
gc # the GC number, incremented at each GC
##s time in seconds since program start
#% percentage of time spent in GC since program start
#+...+# wall-clock/CPU times for the phases of the GC
#->#-># MB heap size at GC start, at GC end, and live heap
# MB goal goal heap size
# P number of processors used
Example here
See also Interpreting GC trace output
gc 6 #48.155s 15%: 0.093+12360+0.32 ms clock,
0.18+7720/21356/3615+0.65 ms cpu, 11039->13278->6876 MB, 14183 MB goal, 8 P
gc 6
#48.155s since program start
15%: of time spent in GC since program start
0.093+12360+0.32 ms clock stop-the-world (STW) sweep termination + concurrent
mark and scan + and STW mark termination
0.18+7720/21356/3615+0.65 ms cpu (GC performed in
line with allocation), background GC time, and idle GC time
11039->13278->6876 MB heap size at GC start, at GC end, and live heap
8 P number of processors used

Related

Julia Distributed slow down to half the single core performance when adding process

I've got a function func that may cost ~50s when running on a single core. Now I want to run it on a server which has got 192-core CPUs for many times. But when I add worker processes to say, 180, the performance of each core slows down. The worst CPU takes ~100s to calculate func.
Can someone help me, please?
Here is the pseudo code
using Distributed
addprocs(180)
#everywhere include("func.jl") # defines func in every process
First try using only 10 workers
#sync #distributed for i in 1:10
func()
end
#sync #distributed for i in 1:10
#time func()
end
From worker #: 43.537886 seconds (243.58 M allocations: 30.004 GiB, 8.16% gc time)
From worker #: 44.242588 seconds (247.59 M allocations: 30.531 GiB, 7.90% gc time)
From worker #: 44.571170 seconds (246.26 M allocations: 30.338 GiB, 8.81% gc time)
...
From worker #: 45.259822 seconds (252.19 M allocations: 31.108 GiB, 8.25% gc time)
From worker #: 46.746692 seconds (246.36 M allocations: 30.346 GiB, 11.21% gc time)
From worker #: 47.451914 seconds (248.94 M allocations: 30.692 GiB, 8.96% gc time)
Seems not bad when using 10 workers
Now we use 180 workers
#sync #distributed for i in 1:180
func()
end
#sync #distributed for i in 1:180
#time func()
end
From worker #: 55.752026 seconds (245.20 M allocations: 30.207 GiB, 9.33% gc time)
From worker #: 57.031739 seconds (245.00 M allocations: 30.176 GiB, 7.70% gc time)
From worker #: 57.552505 seconds (247.76 M allocations: 30.543 GiB, 7.34% gc time)
...
From worker #: 96.850839 seconds (247.33 M allocations: 30.470 GiB, 7.95% gc time)
From worker #: 97.468060 seconds (250.04 M allocations: 30.827 GiB, 6.96% gc time)
From worker #: 98.078816 seconds (250.55 M allocations: 30.883 GiB, 10.87% gc time)
The time increases almost linearly from 55s to 100s.
I've checked by top command that CPU usage may not the bottleneck ("id" keeps >2%). The RAM usage, too (used ~20%).
Other version information:
Julia Version 1.5.3
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) Platinum 9242 CPU # 2.30GHz
Update:
substitute func to minimal example (simple for loop) does not change the slowdown.
reducing the process number to 192/2 alleviates slowdown
The new pseudo code is
addprocs(96)
#everywhere function ss()
sum=0
for i in 1:1000000000
sum+=sin(i)
end
end
#sync #distributed for i in 1:10
ss()
end
#sync #distributed for i in 1:10
#time ss()
end
From worker #: 32.8 seconds ..(8 others).. 34.0 seconds
...
#sync #distributed for i in 1:96
#time ss()
end
From worker #: 38.1 seconds ..(94 others).. 45.4 seconds
You are measuring the time it takes each worker to perform func() and observe performance decrease for a single process when going from 10 processes to 180 parallel processes.
This looks quite normal to me:
Intel cores use hyper-threading so you actually have 96 cores (in more detail - a hyper-threaded core adds only 20-30% performance). It means that 168 of your processes need to share 84 hyper-threaded cores and 12 processes get full 12 cores.
The CPU speed is determined by throttle temperature (https://en.wikipedia.org/wiki/Thermal_design_power) and of course there is so much more space when running 10 processes vs 180 processes
Your tasks are obviously competing for memory. They make a total of over 5TB of memory allocations and you machine has much less than that. The last mile in garbage collecting is always the most expensive one - so if your garbage collectors are squeezed and competing for memory the performance is uneven with surprisingly longer garbage collection times.
Looking at this data I would recommend you to try:
addprocs(192 รท 2)
and see how the performance is then going to change.

Latency of accessing main memory is almost the same order of sending a packet

Looking at Jeff Dean's famous latency guides
Latency Comparison Numbers (~2012)
----------------------------------
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory reference 100 ns 20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy 3,000 ns 3 us
Send 1K bytes over 1 Gbps network 10,000 ns 10 us
Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD
Read 1 MB sequentially from memory 250,000 ns 250 us
Round trip within same datacenter 500,000 ns 500 us
Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memory
Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip
Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory, 20X SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms
One thing which looks somewhat uncanny to me is the time taken to read 1MB sequentially from disk is only 10 times faster than sending a round trip packet across the Atlantic. Can anyone give me more intuition why this feels right.
Q : 1MB SEQ-HDD-READ ~ 10x faster than a CA/NL trans-atlantic RTT - why this feels right?
Some "old" values ( with a few cross-QPI/NUMA updates from 2017 ) to start from:
0.5 ns - CPU L1 dCACHE reference
1 ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance
5 ns - CPU L1 iCACHE Branch mispredict
7 ns - CPU L2 CACHE reference
71 ns - CPU cross-QPI/NUMA best case on XEON E5-46*
100 ns - MUTEX lock/unlock
100 ns - CPU own DDR MEMORY reference
135 ns - CPU cross-QPI/NUMA best case on XEON E7-*
202 ns - CPU cross-QPI/NUMA worst case on XEON E7-*
325 ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
10,000 ns - Compress 1 KB with Zippy PROCESS (+GHz,+SIMD,+multicore tricks)
20,000 ns - Send 2 KB over 1 Gbps NETWORK
250,000 ns - Read 1 MB sequentially from MEMORY
500,000 ns - Round trip within a same DataCenter
10,000,000 ns - DISK seek
10,000,000 ns - Read 1 MB sequentially from NETWORK
30,000,000 ns - Read 1 MB sequentially from DISK
150,000,000 ns - Send a NETWORK packet CA -> Netherlands
| | | |
| | | ns|
| | us|
| ms|
Trans-Atlantic Network RTT :
Global optical networks work roughly at a speed of light ( 300.000.000 m/s )
LA(CA)-AMS(NL) packet has to travel not the geodetical "distance", but over a set of continental and trans-atlantic "submarine" cables, the length of which is way longer ( see the map )
These factors do not "improve" - only the transport capacity is growing, with add-on latencies introduced in light-amplifiers, retiming units and other L1-PHY / L2-/L3-networking technologies are kept under control, as small as possible.
So the LA(CA)-AMS(NL) RTT will remain, using this technology, the same ~ 150 ms
Using other technology, LEO-Sat Cubes - as an example - the "distance" will only grow from ~ 9000 km P2P, by a pair of additional GND/LEO segments, plus by a few addition LEO/LEO hops, which introduce "longer" distance, add-on hop/hop re-processing latencies and capacity will not get any close to the current optical transports available, so no magic jump "back to the future" is to be expected ( we still miss the DeLorean ).
The HDD Disk :
HDD-s can have very fast and very short transport-path for moving the data, but the READ-ops have to wait for the physical / mechanical operations of the media-reading heads ( that takes most of the time here, not the actual data-transfer to the host RAM )
HDD-s are rotational devices, the disk has to "align" where to start the read, which costs the first about 10 [ms]
HDD-s devices store data into a static structure of heads( 2+, reading physical signals from the magnetic plates' surfaces ):cylinders( concentric circular zones on the plate, into which a cyl-aligned reading-head gets settled by disk-head micro-controller):sector( angular-sections of the cylinder, each carrying a block of the same sized data ~ 4KB, 8KB, ... )
These factors do not "improve" - all commodity produced drives remain at industry selected angular speeds of about { 5k4 | 7k2 | 10k | 15k | 18k }-spins/min (RPM). This means, that if a well-compacted data-layouts are maintained on such a disk, one continuous head:cylinder aligned reading round the whole cylinder will take:
>>> [ 1E3 / ( RPM / 60. ) for RPM in ( 5400, 7200, 10000, 15000, 18000 ) ]
11.1 ms per CYL # 5k4 RPM disk,
8.3 ms per CYL # 7k2 RPM disk,
6.0 ms per CYL # 10k RPM disk,
4.0 ms per CYL # 15k RPM disk,
3.3 ms per CYL # 18k RPM disk.
Data-density is also limited by the magnetic media properties. Spintronics R&D will bring some more densely stored data, yet the last 30 years have been well inside the limits of the reliable magnetic storage.
More is to be expected from a trick to co-parallel-read from several heads at-once, yet this goes against the design of the embedded microcontrollers, so most of the reading goes but sequentially, from one head after another, into the HDD-controller onboard buffers, best if no cyl-to-cyl heads mechanical re-alignment were to take place ( technically this depends on the prior data-to-disc layout, maintained by the O/S and possible care of disk-optimisers ( originally called disk disk-"compression", which just tried to re-align the known sequences of FAT-described data-blocks, so as to follow the most optimal trajectory of head:cyl:sector transitions, depending most on the actual device's head:head and cyl:cyl latencies ). So even the most optimistic data-layout takes ~ 13..21 [ms] to seek-and-read but one head:cyl-path
Laws of Physics decide
Some numbers from 2020.
Load from L1 is 4 cycles on Intel Coffee Lake and Ryzen (0.8nsec on a 5GHz CPU).
Load from memory is ~215 cycles on Intel Coffee Lake (43nsec on a 5GHz CPU). ~280 cycles on Ryzen.

Golang Alloc and HeapAlloc vs pprof large discrepancies

I have a Go program that calculates large correlation matrices in memory. To do this I've set up a pipeline of 3 goroutines where the first reads in files, the second calculates the correlation matrix and the last stores the result to disk.
Problem is, when I run the program, the Go runtime allocates ~17GB of memory while a matrix only takes up ~2-3GB. Using runtime.ReadMemStats shows that the program is using ~17GB (and verified by using htop), but pprof only reports about ~2.3GB.
If I look at the mem stats after running one file through the pipeline:
var mem runtime.MemStats
runtime.ReadMemStats(&mem)
fmt.Printf("Total alloc: %d GB\n", mem.Alloc/1000/1000/1000)
This shows the total allocation of the program:
Total alloc: 17 GB
However, if I run go tool pprof mem.prof I get the following results:
(pprof) top5
Showing nodes accounting for 2.21GB, 100% of 2.21GB total
Showing top 5 nodes out of 9
flat flat% sum% cum cum%
1.20GB 54.07% 54.07% 1.20GB 54.07% dataset.(*Dataset).CalcCorrelationMatrix
1.02GB 45.93% 100% 1.02GB 45.93% bytes.makeSlice
0 0% 100% 1.02GB 45.93% bytes.(*Buffer).WriteByte
0 0% 100% 1.02GB 45.93% bytes.(*Buffer).grow
0 0% 100% 1.02GB 45.93% encoding/json.Indent
So I am wondering how I can go about to find out why the program allocates 17 GB, when it seems that the peak memory usage is only ~2.5GB?
Is there a way to trace the memory usage throughout the program using pprof?
EDIT
I ran the program again with GODEBUG=gctrace=1 and got the following trace:
gc 1 #0.017s 0%: 0.005+0.55+0.003 ms clock, 0.022+0/0.47/0.11+0.012 ms cpu, 1227->1227->1226 MB, 1228 MB goal, 4 P
gc 2 #14.849s 0%: 0.003+1.7+0.004 ms clock, 0.015+0/1.6/0.11+0.018 ms cpu, 1227->1227->1227 MB, 2452 MB goal, 4 P
gc 3 #16.850s 0%: 0.006+60+0.003 ms clock, 0.027+0/0.46/59+0.015 ms cpu, 1876->1876->1712 MB, 2455 MB goal, 4 P
gc 4 #22.861s 0%: 0.005+238+0.003 ms clock, 0.021+0/0.46/237+0.015 ms cpu, 3657->3657->3171 MB, 3658 MB goal, 4 P
gc 5 #30.716s 0%: 0.005+476+0.004 ms clock, 0.022+0/0.44/476+0.017 ms cpu, 5764->5764->5116 MB, 6342 MB goal, 4 P
gc 6 #46.023s 0%: 0.005+949+0.004 ms clock, 0.020+0/0.47/949+0.017 ms cpu, 10302->10302->9005 MB, 10303 MB goal, 4 P
gc 7 #64.878s 0%: 0.006+382+0.004 ms clock, 0.024+0/0.46/382+0.019 ms cpu, 16548->16548->7728 MB, 18011 MB goal, 4 P
gc 8 #89.774s 0%: 0.86+2805+0.006 ms clock, 3.4+0/24/2784+0.025 ms cpu, 20208->20208->17088 MB, 20209 MB goal, 4 P
So it is quite obvious that the heap grows steadily through the program, but I am not able to pinpoint where. I've profiled memory usage using pprof.WriteHeapProfile after calling the memory intensive functions:
func memoryProfile(profpath string) {
if _, err := os.Stat(profpath); os.IsNotExist(err) {
os.Mkdir(profpath, os.ModePerm)
}
f, err := os.Create(path.Join(profpath, "mem.mprof"))
fmt.Printf("Creating memory profile in %s", "data/profile/mem.mprof\n")
if err != nil {
panic(err)
}
if err := pprof.WriteHeapProfile(f); err != nil {
panic(err)
}
f.Close()
}
As mentioned in the comments by JimB, the go profile is a sampling profiler and samples memory usage at certain intervals. In my case the sampling was not frequent enough to catch a function (JSON marshalling) that was using extensive amounts of memory.
Increasing the sample rate of the profiler by setting the environment variable
$ export GODEBUG=memprofilerate=1
Will updateruntime.MemProfileRateand the profile now includes every allocated block.
A possible solution (as it was in my case) is that the binary was compiled with -race, which enables checking for race conditions.
The overhead for this is huge and will look like a massive memory leak if checking with htop or something similar, but won't show in any pprof output

Should the stats reported by Go's runtime.ReadMemStats approximately equal the resident memory set reported by ps aux?

In Go Should the "Sys" stat or any other stat/combination reported by runtime.ReadMemStats approximately equal the resident memory set reported by ps aux?
Alternatively, assuming some memory may be swapped out, should the Sys stat be approximately greater than or equal to the RSS?
We have a long-running web service that deals with a high frequency of requests and we are finding that the RSS quickly climbs up to consume virtually all of the 64GB memory on our servers. When it hits ~85% we begin to experience considerable degradation in our response times and in how many concurrent requests we can handle. The run I've listed below is after about 20 hours of execution, and is already at 51% memory usage.
I'm trying to determine if the likely cause is a memory leak (we make some calls to CGO). The data seems to indicate that it is, but before I go down that rabbit hole I want to rule out a fundamental misunderstanding of the statistics I'm using to make that call.
This is an amd64 build targeting linux and executing on CentOS.
Reported by runtime.ReadMemStats:
Alloc: 1294777080 bytes (1234.80MB) // bytes allocated and not yet freed
Sys: 3686471104 bytes (3515.69MB) // bytes obtained from system (sum of XxxSys below)
HeapAlloc: 1294777080 bytes (1234.80MB) // bytes allocated and not yet freed (same as Alloc above)
HeapSys: 3104931840 bytes (2961.09MB) // bytes obtained from system
HeapIdle: 1672339456 bytes (1594.87MB) // bytes in idle spans
HeapInuse: 1432592384 bytes (1366.23MB) // bytes in non-idle span
Reported by ps aux:
%CPU %MEM VSZ RSS
1362 51.3 306936436 33742120

How can i calculate for the estimated completion time of both process

A certain computer system runs in a multi-programming environment using a non-preemptive
algorithm. In this system, two processes A and B are stored in the process queue,
and A has a higher priority than B. The table below shows estimated execution time for each
process; for example, process A uses CPU, I/O, and then CPU sequentially for 30, 60, and 30
milliseconds respectively. Which of the following is the estimated time in milliseconds
to complete both A and B? Here, the multi-processing overhead of OS is negligibly
small. In addition, both CPU and I/O operations can be executed concurrently, but I/O
operations for A and B cannot be performed in parallel.
UNIT : millisecond
CPU I/O CPU
A_______________30___________________60_________________30
B_______________45___________________45__________________--
Please help me.. i need to explain this in front of the class tomorrow but i cant seem get the idea of it...
A has the highest priority, but since the system is non-preemptive, this is only a tiebreaker when both processes need a resource at the same time.
At t=0, A gets the CPU for 30 ms, B waits as it needs the CPU.
At t=30, A releases the CPU, B gets the CPU for 45 ms, while A gets the I/O for 60 ms.
At t=75, the CPU sits idle as B is waiting for A to finish I/O, and A is not ready to use the CPU.
At t=90, A releases I/O and gets the CPU for another 30 ms, while B gets the I/O for 45 ms.
At t=120, A releases the CPU and is finished.
At t=135, B releases I/O and is finished.
It takes the longest path:
Non-preemptive multitasking or cooperative multitasking means that the process is kind of sharing a.e. the CPU time. In the worst case they use the worst time to achieve theire task.
CPU:
B = 45 is longer than A=30
45 +
I/O
A = 60 and B = 45
45 + 60
CPU again:
A = 30
45 + 60 + 30 = 135
i will explain in brief and please elaborate for your classroom discussion:
For your answer :135
when Process A waits for the I/O task,the CPU time will be given to Process B. so the complete time for process A and B would be
Process A (CPU )+ Process A I/O and Process B CPU + Process B I/O
30+60+45 = 135 ms

Resources