golang CPU usage - go

I am aware of [1]. With a few lines of code, I just want to extract the current CPU usage from the top n processes with the most CPU usages. More or less the top 5 rows of top. Using github.com/shirou/gopsutil/process this is straight-forward:
// file: gotop.go
package main
import (
"log"
"time"
"sort"
"github.com/shirou/gopsutil/process"
)
type ProcInfo struct{
Name string
Usage float64
}
type ByUsage []ProcInfo
func (a ByUsage) Len() int { return len(a) }
func (a ByUsage) Swap(i, j int) { a[i], a[j] = a[j], a[i] }
func (a ByUsage) Less(i, j int) bool {
return a[i].Usage > a[j].Usage
}
func main() {
for {
processes, _ := process.Processes()
var procinfos []ProcInfo
for _, p := range processes{
a, _ := p.CPUPercent()
n, _ := p.Name()
procinfos = append(procinfos, ProcInfo{n, a})
}
sort.Sort(ByUsage(procinfos))
for _, p := range procinfos[:5]{
log.Printf(" %s -> %f", p.Name, p.Usage)
}
time.Sleep(3 * time.Second)
}
}
While the refresh rate in this implementation gotop is 3 seconds like top does, gotop has approx. 5-times higher demand on CPU usage to get these values like top does. Is there any trick to more efficiently read the 5 topmost consuming processes? I also tried to find the implementation of top to see how this is implemented there.
Is psutils responsible for this slown-down? I found as cpustat implemented in GO as well. But even sudo ./cpustat -i 3000 -s 1 seems to be not as efficient as top.
The main motivation is to monitor the usage of the current machine with a fairly small amount of computational effort so that it can run as a service in the background.
It seems, even htop is only reading /proc/stat.
edit
as proposed in the comments here is the result when profiling
Showing top 10 nodes out of 46 (cum >= 70ms)
flat flat% sum% cum cum%
40ms 40.00% 40.00% 40ms 40.00% syscall.Syscall
10ms 10.00% 50.00% 30ms 30.00% github.com/shirou/gopsutil/process.(*Process).fillFromStatusWithContext
10ms 10.00% 60.00% 30ms 30.00% io/ioutil.ReadFile
10ms 10.00% 70.00% 10ms 10.00% runtime.slicebytetostring
10ms 10.00% 80.00% 20ms 20.00% strings.FieldsFunc
10ms 10.00% 90.00% 10ms 10.00% syscall.Syscall6
10ms 10.00% 100% 10ms 10.00% unicode.IsSpace
0 0% 100% 10ms 10.00% bytes.(*Buffer).ReadFrom
0 0% 100% 70ms 70.00% github.com/shirou/gopsutil/process.(*Process).CPUPercent
0 0% 100% 70ms 70.00% github.com/shirou/gopsutil/process.(*Process).CPUPercentWithContext
Seems like the syscall takes forever. A tree dump is here:
https://gist.github.com/PatWie/4fa528b7d7b1d0b5c1b665c056671477
This changes the question into:
- Is the syscall the issue?
- Are there any c-sources for the top program? I just found the implementation of htop
- Is there an easy fix? I consider to write it in c and just wrap it for go.

github.com/shirou/gopsutil/process uses ioutil.ReadFile which access the filesystem less efficiently than top. In particular, ReadFile:
calls Stat which adds an extra unnecessary Syscall.
uses os.Open instead of unix.Openat + os.NewFile which causes extra kernel time traversing /proc when resolving the path. os.NewFile is still a little inefficient since it always checks whether the file descriptor is non-blocking. This can be avoided by using the golang.org/x/sys/unix or syscall packages directly.
Retrieving process details under Linux is fairly inefficient in general (lots of filesystem scanning, marshalling text data). However, you can achieve similar performance to top with Go by fixing the filesystem access (as described above).

Related

Go CountBits exercise: n strings slices performing a lot better than log2n math package calls

I'm trying to solve an exercise where you have to count the number of '1' bits of any number.
I came up with 4 main ideas:
Make a recursive function countBitsString, where at each iteration it gets the string corresponding to the base2 number removing the first character. If this is 1, increment a counter and continue.
The same as above but with a pointer to that string instead of the value (TODO)
Use logarithm logic to understand how many power of 2 are inside this number, for each of them we have a '1' bit up so increment the counter. Increment one more if the number is odd (countBitsSmart)
Work with binary logic to check the bit and then shift left (TODO)
Now, here's my current code and later the results
package main
import (
"fmt"
"strconv"
"math/rand"
"math"
"time"
)
/*
* implement a function that counts the number of set of bits in binary representation
* Complete the 'countBits' function below.
* The function is expected to return an int32.
* The function accepts unit32 num as parameter.
*/
func countBitsString(num uint32) int32 {
num_64 := int64(num)
num_2 := strconv.FormatInt(num_64, 2)
_, counter := recursiveCount(num_2, 0)
return counter
}
func recursiveCount(s string, counter int32) (string, int32) {
if len(s) < 1 {
return "", counter
}
if s[:1] == "1" {
counter+=1
}
//fmt.Printf("Current counter: %d\n", counter)
//fmt.Printf("Current char %s\n", s[:1])
return recursiveCount(s[1:], counter)
}
func countBitsBinary(num uint32) int32 {
fmt.Println("TODO")
// convert uint23 in binary form
// for binary_number > 0
// if current bit == 1 => counter++
// shift left binary_number
return 0
}
func countBitsSmart(num uint32) int32{
var upBits int32
var highestPower uint32 = math.MaxUint32
if num % 2 >0 {
upBits+=1
// fmt.Printf("\tUP bits = %d\n", upBits)
}
for ; (num > 2 && highestPower > 1); upBits++ {
highestPower = uint32(math.Log2(float64(num)))
num = num - uint32(math.Pow(2, float64(highestPower)))
// fmt.Printf("\tlog2 = %d\n", highestPower)
// fmt.Printf("\tnum %d rest %d\n", num, num%2)
// fmt.Printf("\tUP bits = %d\n", upBits)
}
return upBits
}
// Profiling with execution time and calling
func invoker(numInput []uint32, funcName string, countFunc func(num uint32) int32) {
fmt.Println("\n=========================================================")
start := time.Now()
for _, numTemp := range numInput{
// fmt.Printf("> Number: %d (%s)\tUp bits: %d\n", numTemp, strconv.FormatInt(int64(numTemp), 2), countFunc( uint32(numTemp) ))
countFunc( uint32(numTemp))
}
fmt.Printf("Time elapsed for %v func= %vs", funcName, time.Since(start))
}
func main() {
const testSize = 10000000
var numInput = make([]uint32, testSize)
for i:=0; i<testSize; i++ {
numInput[i] = uint32(rand.Intn(500))
}
fmt.Printf("\n Test size = %d\n", testSize)
invoker(numInput, "countBitsString", countBitsString)
invoker(numInput, "countBitsSmart", countBitsSmart)
fmt.Println()
}
How can the second function perform 5 times worse than the previous one?
Is it just for the calls to math package?
Thanks
Using floating point Log2/Pow calculations is a very expensive way to calculate bits. Converting to a binary string is much faster since it can be done with bit shifts/integer calculations (it's still expensive). The Wikipedia Hamming Weight page has some explanations of very fast popcount calculations via bitmasks.
You can see the relative performance via Go's builtin benchmark support. It can help understand how code performs:
// main_test.go
package main
import (
"math/bits"
"testing"
)
// Simple benchmarks to get started.
// You could also try benchmarking different numbers to understand
// performance differences. There is some inherit bias in benchmarking
// sequential numbers starting from 0. You could try a list of 256 preset
// random numbers. There are many options..
func BenchmarkCountBitsString(b *testing.B) {
for i := 0; i < b.N; i++ {
dummyInt32 = countBitsString(uint32(i))
}
}
func BenchmarkCountBitsSmart(b *testing.B) {
for i := 0; i < b.N; i++ {
dummyInt32 = countBitsSmart(uint32(i))
}
}
func BenchmarkOnesCount(b *testing.B) {
for i := 0; i < b.N; i++ {
dummyInt = bits.OnesCount32(uint32(i))
}
}
// Ensure some optimisations don't occur by assigning to a global.
var (
dummyInt int
dummyInt32 int32
)
Then you can easily understand the relative performance:
$ go test -bench .
goos: linux
goarch: amd64
pkg: stack/bench
cpu: Intel(R) Core(TM) i7-8550U CPU # 1.80GHz
BenchmarkCountBitsString-8 10388419 99.97 ns/op
BenchmarkCountBitsSmart-8 2036588 617.6 ns/op
BenchmarkOnesCount-8 1000000000 0.5540 ns/op
PASS
ok stack/bench 3.625s
Inspecting performance with pprof requires some extra parameters:
Disable tests with -run XXX (XXX doesn't match any tests)
Limit to a single function benchmark, otherwise the faster function will be run more often until it consumes as much time as the slower function (~5s in the final run).
Use a fixed number of iterations for the benchmark time so the profiles are comparable between functions
Save CPU profile. This also causes the test binary to be kept as DIR.test (bench.test in this example).
Benchmark details for countBitsSmart show it spends ~55% time in math.Log2 and ~38% time in math.Pow:
$ go test -bench CountBitsSmart -run XXX -benchtime 10000000x -cpuprofile cpu.prof
goos: linux
goarch: amd64
pkg: stack/bench
cpu: Intel(R) Core(TM) i7-8550U CPU # 1.80GHz
BenchmarkCountBitsSmart-8 10000000 703.0 ns/op
PASS
ok stack/bench 7.226s
$ go tool pprof -top -cum bench.test cpu.prof
File: bench.test
Type: cpu
Time: Apr 17, 2022 at 11:19pm (AEST)
Duration: 7.22s, Total samples = 7.01s (97.10%)
Showing nodes accounting for 7s, 99.86% of 7.01s total
Dropped 1 node (cum <= 0.04s)
flat flat% sum% cum cum%
0.01s 0.14% 0.14% 7.01s 100% stack/bench.BenchmarkCountBitsSmart
0 0% 0.14% 7.01s 100% testing.(*B).launch
0 0% 0.14% 7.01s 100% testing.(*B).runN
0.43s 6.13% 6.28% 7s 99.86% stack/bench.countBitsSmart
0.07s 1% 7.28% 3.88s 55.35% math.Log2 (inline)
0.42s 5.99% 13.27% 3.81s 54.35% math.log2
0.02s 0.29% 13.55% 3.07s 43.79% math.Log (inline)
3.05s 43.51% 57.06% 3.05s 43.51% math.archLog
0.15s 2.14% 59.20% 2.69s 38.37% math.Pow (inline)
1.20s 17.12% 76.32% 2.54s 36.23% math.pow
0.07s 1% 77.32% 0.57s 8.13% math.Frexp (inline)
[...]
$ go tool pprof -list bench.countBitsSmart bench.test cpu.prof
Total: 7.01s
ROUTINE ======================== stack/bench.countBitsSmart in /home/.../stack/bench/main.go
430ms 7s (flat, cum) 99.86% of Total
. . 54: if num%2 > 0 {
. . 55: upBits += 1
. . 56: // fmt.Printf("\tUP bits = %d\n", upBits)
. . 57: }
. . 58:
40ms 40ms 59: for ; num > 2 && highestPower > 1; upBits++ {
250ms 4.13s 60: highestPower = uint32(math.Log2(float64(num)))
90ms 2.78s 61: num = num - uint32(math.Pow(2, float64(highestPower)))
. . 62: // fmt.Printf("\tlog2 = %d\n", highestPower)
. . 63: // fmt.Printf("\tnum %d rest %d\n", num, num%2)
. . 64: // fmt.Printf("\tUP bits = %d\n", upBits)
. . 65: }
. . 66:
50ms 50ms 67: return upBits
. . 68:}
. . 69:
. . 70:// Profiling with execution time and calling
. . 71:func invoker(numInput []uint32, funcName string, countFunc func(num uint32) int32) {
. . 72: fmt.Println("\n=========================================================")
The countBitsString benchmark shows about half the time is spent in recursiveCount and half in strconv.FormatInt. This indicates recursiveCount could be a good target for optimisation.
$ go test -bench CountBitsString -run XXX -benchtime 10000000x -cpuprofile cpu.profgoos: linux
goarch: amd64
pkg: stack/bench
cpu: Intel(R) Core(TM) i7-8550U CPU # 1.80GHz
BenchmarkCountBitsString-8 10000000 99.13 ns/op
PASS
ok stack/bench 1.111s
$ go tool pprof -top -cum bench.test cpu.prof
File: bench.test
Type: cpu
Time: Apr 17, 2022 at 11:15pm (AEST)
Duration: 1.10s, Total samples = 1.05s (95.18%)
Showing nodes accounting for 1.05s, 100% of 1.05s total
flat flat% sum% cum cum%
0.02s 1.90% 1.90% 1.01s 96.19% stack/bench.BenchmarkCountBitsString
0 0% 1.90% 1.01s 96.19% testing.(*B).launch
0 0% 1.90% 1.01s 96.19% testing.(*B).runN
0.01s 0.95% 2.86% 0.99s 94.29% stack/bench.countBitsString
0.51s 48.57% 51.43% 0.51s 48.57% stack/bench.recursiveCount
0 0% 51.43% 0.47s 44.76% strconv.FormatInt
0.21s 20.00% 71.43% 0.47s 44.76% strconv.formatBits
0.03s 2.86% 74.29% 0.26s 24.76% runtime.slicebytetostring
0.14s 13.33% 87.62% 0.22s 20.95% runtime.mallocgc
0.04s 3.81% 91.43% 0.04s 3.81% runtime.nextFreeFast (inline)
[...]
$ go tool pprof -list bench.countBitsString bench.test cpu.prof
Total: 1.05s
ROUTINE ======================== stack/bench.countBitsString in /home/.../stack/bench/main.go
10ms 990ms (flat, cum) 94.29% of Total
. . 14: * implement a function that counts the number of set of bits in binary representation
. . 15: * Complete the 'countBits' function below.
. . 16: * The function is expected to return an int32.
. . 17: * The function accepts unit32 num as parameter.
. . 18: */
10ms 10ms 19:func countBitsString(num uint32) int32 {
. . 20:
. . 21: num_64 := int64(num)
. 470ms 22: num_2 := strconv.FormatInt(num_64, 2)
. 510ms 23: _, counter := recursiveCount(num_2, 0)
. . 24: return counter
. . 25:}
. . 26:
. . 27:func recursiveCount(s string, counter int32) (string, int32) {
. . 28: if len(s) < 1 {
These benchmarks show math.Log2+math.Pow (6.57s) is ~14x slower than strconv.FormatInt (0.47s).
I also highly recommend using the pprof web interface to interactively explore performance. This is a little non-obvious, but can be started with:
$ go tool pprof -http : bench.test cpu.prof
Dave Cheney also has an instructive blog on benchmarking with Go:
https://dave.cheney.net/2013/06/30/how-to-write-benchmarks-in-go

Disable array/slice bounds checking in Golang to improve performance

I'm writing a NES/Famicom emulator. I register a callback function that will be called every time a pixel is rendered. It means that my callback function will be called about 3.5 million times (256width * 240height * 60fps).
In my callback function, there are many array/slice operations, and I found that Go will do bounds checking every time I index an element in it. But the indexes are results of bit and operations so I can tell that it will NOT exceed both bounds.
So, I'm here to ask if there is a way to disable bounds checking?
Thank you.
Using gcflags you can disable bounds checking.
go build -gcflags=-B .
If you really need to avoid the bounds check, you can use the unsafe package and use C-style pointer arithmetic to perform your lookups:
index := 2
size := unsafe.Sizeof(YourStruct{})
p := unsafe.Pointer(&yourStructSlice[0])
indexp := (unsafe.Pointer)(uintptr(p) + size*uintptr(index))
yourStructPtr := (*YourStruct)(indexp)
https://play.golang.org/p/GDNphKsJPOv
You should time it to determine how much CPU run time you are actually saving by doing this, but it is probably true it is possible to make it faster using this approach.
Also, you may want to have a look at the actual generated instructions to make sure that what you outputting is actually more efficient. Doing lookups without bounds checks very well may be more trouble than it's worth. Some info on how to do that here: https://github.com/teh-cmc/go-internals/blob/master/chapter1_assembly_primer/README.md
Another common approach is to write performance critical code in assembly (see https://golang.org/doc/asm). Ain't no automatic bounds checking in asm :)
The XY Problem
The XY problem is asking about your attempted solution rather than
your actual problem.
Your real problem is overall performance. Let's see some benchmarks to show that bounds checking is a significant problem. It may not be a significant problem. For example, less than one millisecond per second,
Bounds check:
BenchmarkPixels-4 300 4034580 ns/op
No bounds check:
BenchmarkPixels-4 500 3150985 ns/op
bounds_test.go:
package main
import (
"testing"
)
const (
width = 256
height = 240
frames = 60
)
var pixels [width * height]byte
func writePixel(w, h int) {
pixels[w*height+h] = 42
}
func BenchmarkPixels(b *testing.B) {
for N := 0; N < b.N; N++ {
for f := 0; f < frames; f++ {
for w := 0; w < width; w++ {
for h := 0; h < height; h++ {
writePixel(w, h)
}
}
}
}
}

How to detect what is preventing multiple cores being used in golang?

So, I have a piece of code that is concurrent and it's meant to be run onto each CPU/core.
There are two large vectors with input/output values
var (
input = make([]float64, rowCount)
output = make([]float64, rowCount)
)
these are filled and I want to compute the distance (error) between each input-output pair. Being the pairs independent, a possible concurrent version is the following:
var d float64 // Error to be computed
// Setup a worker "for each CPU"
ch := make(chan float64)
nw := runtime.NumCPU()
for w := 0; w < nw; w++ {
go func(id int) {
var wd float64
// eg nw = 4
// worker0, i = 0, 4, 8, 12...
// worker1, i = 1, 5, 9, 13...
// worker2, i = 2, 6, 10, 14...
// worker3, i = 3, 7, 11, 15...
for i := id; i < rowCount; i += nw {
res := compute(input[i])
wd += distance(res, output[i])
}
ch <- wd
}(w)
}
// Compute total distance
for w := 0; w < nw; w++ {
d += <-ch
}
The idea is to have a single worker for each CPU/core, and each worker processes a subset of the rows.
The problem I'm having is that this code is no faster than the serial code.
Now, I'm using Go 1.7 so runtime.GOMAXPROCS should be already set to runtime.NumCPU(), but even setting it explicitly does not improves performances.
distance is just (a-b)*(a-b);
compute is a bit more complex, but should be reentrant and use global data only for reading (and uses math.Pow and math.Sqrt functions);
no other goroutine is running.
So, besides accessing the global data (input/output) for reading, there are no locks/mutexes that I am aware of (not using math/rand, for example).
I also compiled with -race and nothing emerged.
My host has 4 virtual cores, but when I run this code I get (using htop) CPU usage to 102%, but I expected something around 380%, as it happened in the past with other go code that used all the cores.
I would like to investigate, but I don't know how the runtime allocates threads and schedule goroutines.
How can I debug this kind of issues? Can pprof help me in this case? What about the runtime package?
Thanks in advance
Sorry, but in the end I got the measurement wrong. #JimB was right, and I had a minor leak, but not so much to justify a slowdown of this magnitude.
My expectations were too high: the function I was making concurrent was called only at the beginning of the program, therefore the performance improvement was just minor.
After applying the pattern to other sections of the program, I got the expected results. My mistake in evaluation which section was the most important.
Anyway, I learned a lot of interesting things meanwhile, so thanks a lot to all the people trying to help!

golang slice allocation performance

I stumbled upon an interesting thing while checking performance of memory allocation in GO.
package main
import (
"fmt"
"time"
)
func main(){
const alloc int = 65536
now := time.Now()
loop := 50000
for i := 0; i<loop;i++{
sl := make([]byte, alloc)
i += len(sl) * 0
}
elpased := time.Since(now)
fmt.Printf("took %s to allocate %d bytes %d times", elpased, alloc, loop)
}
I am running this on a Core-i7 2600 with go version 1.6 64bit (also same results on 32bit) and 16GB of RAM (on WINDOWS 10)
so when alloc is 65536 (exactly 64K) it runs for 30 seconds (!!!!).
When alloc is 65535 it takes ~200ms.
Can someone explain this to me please?
I tried the same code at home with my core i7-920 # 3.8GHZ but it didn't show same results (both took around 200ms). Anyone has an idea what's going on?
Setting GOGC=off improved performance (down to less than 100ms). Why?
becaue of escape analysis. When you build with go build -gcflags -m the compiler prints whatever allocations escapes to heap. It really depends on your machine and GO compiler version but when the compiler decides that the allocation should move to heap it means 2 things:
1. the allocation will take longer (since "allocating" on the stack is just 1 cpu instruction)
2. the GC will have to clean up that memory later - costing more CPU time
for my machine, the allocation of 65536 bytes escapes to heap and 65535 doesn't.
that's why 1 bytes changed the whole proccess from 200ms to 30s. Amazing..
Note/Update 2021: as Tapir Liui notes in Go101 with this tweet:
As of Go 1.17, Go runtime will allocate the elements of slice x on stack if the compiler proves they are only used in the current goroutine and N <= 64KB:
var x = make([]byte, N)
And Go runtime will allocate the array y on stack if the compiler proves it is only used in the current goroutine and N <= 10MB:
var y [N]byte
Then how to allocated (the elements of) a slice which size is larger than 64KB but not larger than 10MB on stack (and the slice is only used in one goroutine)?
Just use the following way:
var y [N]byte
var x = y[:]
Considering stack allocation is faster than heap allocation, that would have a direct effect on your test, for alloc equals to 65536 and more.
Tapir adds:
In fact, we could allocate slices with arbitrary sum element sizes on stack.
const N = 500 * 1024 * 1024 // 500M
var v byte = 123
func createSlice() byte {
var s = []byte{N: 0}
for i := range s { s[i] = v }
return s[v]
}
Changing 500 to 512 make program crash.
the reason is very simple.
const alloc int = 65535
0x0000 00000 (example.go:8) TEXT "".main(SB), ABIInternal, $65784-0
const alloc int = 65536
0x0000 00000 (example.go:8) TEXT "".main(SB), ABIInternal, $248-0
the difference is where the slice are created.

Program execution taking almost same usertime on CPU as well as GPU?

The program for finding prime numbers using OpenCL 1.1 gave the following benchmarks :
Device : CPU
Realtime : approx. 3 sec
Usertime : approx. 32 sec
Device : GPU
Realtime - approx. 37 sec
Usertime - approx. 32 sec
Why is the usertime of execution by GPU not less than that of CPU? Is data/task parallelization not occuring?
System specifications :64-bit CentOS 5.3 system with two ATI Radeon 5970 graphics card + Intel Core i7 processor(12 cores)
Your kernel is rather inefficient, I have an adjusted one below for you to consider. As to why it runs better on a cpu device...
Using your algorithm, the work items take varying amounts of time to execute. They will take longer as the numbers tested grow larger. A work group on a gpu will not finish until all of its items are finished some of the hardware will be left idle until the last item is done. On a cpu, it behaves more like a loop iterating over the kernel items, so the difference in cycles needed to compute each item won't drastically affect the performance.
'A' is not used by the kernel. It should not be copied unless it is used. It looks like you wanted to test the A[i] rather then 'i' itself though.
I think the gpu would be much better at FFT-based prime calculations, or even a sieve algorithm.
{
int t;
int i = get_global_id(0);
int end = sqrt(i);
if(i%2){
B[i] = 0;
}else{
B[i] = 1; //assuming only that it should be non-zero
}
for ( t = 3; (t<=end)&&(B[i] > 0) ; t+=2 ) {
if ( i % t == 0 ) {
B[ i ] = 0;
}
}
}

Resources