What does runtime.memclrNoHeapPointers do? - go

I am profiling a library and see that a function called runtime.memclrNoHeapPointers is taking up about 0.82seconds of the cpu-time.
What does this function do, and what does this tell me about the memory-usage of the library i am profiling?
The output, for completeness:
File: gribtest.test
Type: cpu
Time: Feb 12, 2019 at 8:27pm (CET)
Duration: 5.21s, Total samples = 5.11s (98.15%)
Showing nodes accounting for 4.94s, 96.67% of 5.11s total
Dropped 61 nodes (cum <= 0.03s)
flat flat% sum% cum cum%
1.60s 31.31% 31.31% 1.81s 35.42% github.com/nilsmagnus/grib/griblib.(*BitReader).readBit
1.08s 21.14% 52.45% 2.89s 56.56% github.com/nilsmagnus/grib/griblib.(*BitReader).readUint
0.37s 7.24% 59.69% 0.82s 16.05% encoding/binary.(*decoder).value
0.35s 6.85% 66.54% 0.35s 6.85% runtime.memclrNoHeapPointers

func memclrNoHeapPointers(ptr unsafe.Pointer, n uintptr)
memclrNoHeapPointers clears n bytes starting at ptr.
Usually you should use typedmemclr. memclrNoHeapPointers should be
used only when the caller knows that *ptr contains no heap pointers
because either:
*ptr is initialized memory and its type is pointer-free.
*ptr is uninitialized memory (e.g., memory that's being reused
for a new allocation) and hence contains only "junk".
in memclr_*.s go:noescape
See https://github.com/golang/go/blob/9e277f7d554455e16ba3762541c53e9bfc1d8188/src/runtime/stubs.go#L78
This is part of the garbage collector. You can see the declaration here.
The specifics of what it does are CPU dependent. See the various memclr_*.s files in the runtime for implmentation
This does seem like a long time in the GC, but it's hard to say something about the memory usage of the library with just the data you've shown I think.

Related

How to maximize data transfer speed over USB (configured as virtual com port)

I have troubles to get my streaming over OTG-USB-FS configured as VCP. In my disposition I have nucleo-h743zi board that seems to doing a good job at sending me data, but on PC side I have a problem to receive that data.
for(;;) {
#define number_of_ccr 1024
unsigned int lpBuffer[number_of_ccr] = {0};
unsigned long nNumberOfBytesToRead = number_of_ccr*4;
unsigned long lpNumberOfBytesRead;
QueryPerformanceCounter(&startCounter);
ReadFile(
hSerial,
lpBuffer,
nNumberOfBytesToRead,
&lpNumberOfBytesRead,
NULL
);
if(!strcmp(lpBuffer, "end\r\n")) {
CloseHandle(FileHandle);
fprintf(stderr, "end flag was received\n");
break;
}
else if(lpNumberOfBytesRead > 0) {
// NOTE(): succeed
QueryPerformanceCounter(&endCounter);
time = Win32GetSecondsElapsed(startCounter, endCounter);
char *copyString = "copy";
WriteFile(hSerial, copyString , strlen(copyString), &bytes_written, NULL);
DWORD BytesWritten;
// write data to file
WriteFile(FileHandle, lpBuffer, nNumberOfBytesToRead, &BytesWritten, 0);
}
}
QPC shows that speed was 0.00733297970 - it's one time for one successful data block transfer (1024*4 bytes).
this is the Listener code, I bet that this is not how it should be done, so I here to seek advices. I was hopping that maybe full streaming without control sequences ("copy") will be possible, but in that case I can't receive adjacent data (within one transfer block it's OKAY, but two consecutive received blocks aren't adjacent.
Example:
block_1: 1 2 3 4 5 6
block_2: 13 14 15 16 17 18
Is there any way to speed up my receiving?
(I was trying O2 key without any success)
You need to configure buffer on PC side that will be 2 or 3 times the buffer you are transfer from your board, and use something like double buffer scheme for transferring the data. You transfer the first buffer while filing the second, then alternate.
Good thing to do is to activate caches, and place the buffers in fast memory for stm32h7 (it's 1 domain RAM).
But if your interface do not match the speed you needed, there will be no tricks to do this. Except maybe one, if your controller is fast enough -> you can implement and use lossless data compression on that data of yours and transfer compressed files. If you transmit low entropy data, this could give you a solid boost in speed.

pprof (for golang) doesn't show details for my package

I've been trying to profile my go application (evm-specification-miner) with pprof, but the output is not really useful:
(pprof) top5
108.59mins of 109.29mins total (99.36%)
Dropped 607 nodes (cum <= 0.55mins)
Showing top 5 nodes out of 103 (cum >= 0.98mins)
flat flat% sum% cum cum%
107.83mins 98.66% 98.66% 108.64mins 99.40% [evm-specification-miner]
0.36mins 0.33% 98.99% 6mins 5.49% net.dialIP
0.30mins 0.28% 99.27% 4.18mins 3.83% net.listenIP
0.06mins 0.052% 99.32% 34.66mins 31.71%
github.com/urfave/cli.BoolFlag.ApplyWithError
0.04mins 0.036% 99.36% 0.98mins 0.9% net.probeIPv6Stack
And here is the cumulative output:
(pprof) top5 --cum
1.80hrs of 1.82hrs total (98.66%)
Dropped 607 nodes (cum <= 0.01hrs)
Showing top 5 nodes out of 103 (cum >= 1.53hrs)
flat flat% sum% cum cum%
1.80hrs 98.66% 98.66% 1.81hrs 99.40% [evm-specification-miner]
0 0% 98.66% 1.53hrs 83.93% net.IP.matchAddrFamily
0 0% 98.66% 1.53hrs 83.92% net.(*UDPConn).WriteToUDP
0 0% 98.66% 1.53hrs 83.90% net.sockaddrToUDP
0 0% 98.66% 1.53hrs 83.89% net.(*UDPConn).readMsg
As you can see, most of the time is spent in evm-specification-miner (which is the name of my go application), but I've been unable to obtain more details or even understand what these square brackets meant (there is a question with a similar problem, but it didn't receive any answer).
Here are the build and pprof commands:
go install evm-specification-miner
go tool pprof evm-specification-miner cpuprof
I've even tried the debug flags -gcflags "-N -l" (as noted here: https://golang.org/doc/gdb#Introduction), to no avail.
The profiling is done with calls to pprof.StartCPUProfile() and pprof.StopCPUProfile() as is explained by this blog post: https://blog.golang.org/profiling-go-programs:
func StartProfiling(cpuprof string) error {
f, err := os.Create(cpuprof)
if err != nil {
return err
}
return pprof.StartCPUProfile(f)
}
func StopProfiling() error {
pprof.StopCPUProfile()
return nil
}
StartProfiling is called at the beginning of main(), and StopProfiling when a signal (interrupt or kill) is received (or if the program terminates normally). This profile was obtained after an interruption.
Looks like updating to 1.9rc1 fixed it.
I no longer have [evm-specifiation-miner] in the profile (for the record, the top functions do not even come from my own package, so it is even weirder than they did not appear before).

Why Locking in Go much slower than Java? Lot's of time spent in Mutex.Lock() Mutex.Unlock()

I've written a small Go library (go-patan) that collects a running min/max/avg/stddev of certain variables. I compared it to an equivalent Java implementation (patan), and to my surprise the Java implementation is much faster. I would like to understand why.
The library basically consists of a simple data store with a lock that serializes reads and writes. This is a snippet of the code:
type Store struct {
durations map[string]*Distribution
counters map[string]int64
samples map[string]*Distribution
lock *sync.Mutex
}
func (store *Store) addSample(key string, value int64) {
store.addToStore(store.samples, key, value)
}
func (store *Store) addDuration(key string, value int64) {
store.addToStore(store.durations, key, value)
}
func (store *Store) addToCounter(key string, value int64) {
store.lock.Lock()
defer store.lock.Unlock()
store.counters[key] = store.counters[key] + value
}
func (store *Store) addToStore(destination map[string]*Distribution, key string, value int64) {
store.lock.Lock()
defer store.lock.Unlock()
distribution, exists := destination[key]
if !exists {
distribution = NewDistribution()
destination[key] = distribution
}
distribution.addSample(value)
}
I've benchmarked the GO and Java implementations (go-benchmark-gist, java-benchmark-gist) and Java wins by far, but I don't understand why:
Go Results:
10 threads with 20000 items took 133 millis
100 threads with 20000 items took 1809 millis
1000 threads with 20000 items took 17576 millis
10 threads with 200000 items took 1228 millis
100 threads with 200000 items took 17900 millis
Java Results:
10 threads with 20000 items takes 89 millis
100 threads with 20000 items takes 265 millis
1000 threads with 20000 items takes 2888 millis
10 threads with 200000 items takes 311 millis
100 threads with 200000 items takes 3067 millis
I've profiled the program with the Go's pprof and generated a call graph call-graph. This shows that it basically spends all the time in sync.(*Mutex).Lock() and sync.(*Mutex).Unlock().
The Top20 calls according to the profiler:
(pprof) top20
59110ms of 73890ms total (80.00%)
Dropped 22 nodes (cum <= 369.45ms)
Showing top 20 nodes out of 65 (cum >= 50220ms)
flat flat% sum% cum cum%
8900ms 12.04% 12.04% 8900ms 12.04% runtime.futex
7270ms 9.84% 21.88% 7270ms 9.84% runtime/internal/atomic.Xchg
7020ms 9.50% 31.38% 7020ms 9.50% runtime.procyield
4560ms 6.17% 37.56% 4560ms 6.17% sync/atomic.CompareAndSwapUint32
4400ms 5.95% 43.51% 4400ms 5.95% runtime/internal/atomic.Xadd
4210ms 5.70% 49.21% 22040ms 29.83% runtime.lock
3650ms 4.94% 54.15% 3650ms 4.94% runtime/internal/atomic.Cas
3260ms 4.41% 58.56% 3260ms 4.41% runtime/internal/atomic.Load
2220ms 3.00% 61.56% 22810ms 30.87% sync.(*Mutex).Lock
1870ms 2.53% 64.10% 1870ms 2.53% runtime.osyield
1540ms 2.08% 66.18% 16740ms 22.66% runtime.findrunnable
1430ms 1.94% 68.11% 1430ms 1.94% runtime.freedefer
1400ms 1.89% 70.01% 1400ms 1.89% sync/atomic.AddUint32
1250ms 1.69% 71.70% 1250ms 1.69% github.com/toefel18/go-patan/statistics/lockbased.(*Distribution).addSample
1240ms 1.68% 73.38% 3140ms 4.25% runtime.deferreturn
1070ms 1.45% 74.83% 6520ms 8.82% runtime.systemstack
1010ms 1.37% 76.19% 1010ms 1.37% runtime.newdefer
1000ms 1.35% 77.55% 1000ms 1.35% runtime.mapaccess1_faststr
950ms 1.29% 78.83% 15660ms 21.19% runtime.semacquire
860ms 1.16% 80.00% 50220ms 67.97% main.Benchmrk.func1
Can someone explain why locking in Go seems to be so much slower than in Java, what am I doing wrong? I've also written a channel based implementation in Go, but that is even slower.
It is best to avoid defer in tiny functions that need high performance since it is expensive. In most other cases, there is no need to avoid it since the cost of defer is outweighed by the code around it.
I'd also recommend using lock sync.Mutex instead of using a pointer. The pointer creates a slight amount of extra work for the programmer (initialisation, nil bugs), and a slight amount of extra work for the garbage collector.
I've also posted this question on the golang-nuts group. The reply from Jesper Louis Andersen explains quite well that Java uses synchronization optimization techniques such as lock escape analysis/lock elision and lock coarsening.
Java JIT might be taking the lock and allowing multiple updates at once within the lock to increase performance. I ran the Java benchmark with -Djava.compiler=NONE which gave dramatic performance, but is not a fair comparison.
I assume that many of these optimization techniques have less impact in a production environment.

java8 -XX:+UseCompressedOops -XX:ObjectAlignmentInBytes=16

So, I'm trying to run some simple code, jdk-8, output via jol
System.out.println(VMSupport.vmDetails());
Integer i = new Integer(23);
System.out.println(ClassLayout.parseInstance(i)
.toPrintable());
The first attempt is to run it with compressed oops disabled and compressed klass also on 64-bit JVM.
-XX:-UseCompressedOops -XX:-UseCompressedClassPointers
The output, pretty much expected is :
Running 64-bit HotSpot VM.
Objects are 8 bytes aligned.
java.lang.Integer object internals:
OFFSET SIZE TYPE DESCRIPTION VALUE
0 4 (object header) 01 00 00 00 (00000001 00000000 00000000 00000000) (1)
4 4 (object header) 00 00 00 00 (00000000 00000000 00000000 00000000) (0)
8 4 (object header) 48 33 36 97 (01001000 00110011 00110110 10010111) (-1758055608)
12 4 (object header) 01 00 00 00 (00000001 00000000 00000000 00000000) (1)
16 4 int Integer.value 23
20 4 (loss due to the next object alignment)
Instance size: 24 bytes (reported by Instrumentation API)
Space losses: 0 bytes internal + 4 bytes external = 4 bytes total
That makes sense : 8 bytes klass word + 8 bytes mark word + 4 bytes for the actual value and 4 for padding (to align on 8 bytes) = 24 bytes.
The second attempt it to run it with compressed oops enabled compressed klass also on 64-bit JVM.
Again, the output is pretty much understandable :
Running 64-bit HotSpot VM.
Using compressed oop with 3-bit shift.
Using compressed klass with 3-bit shift.
Objects are 8 bytes aligned.
OFFSET SIZE TYPE DESCRIPTION VALUE
0 4 (object header) 01 00 00 00 (00000001 00000000 00000000 00000000) (1)
4 4 (object header) 00 00 00 00 (00000000 00000000 00000000 00000000) (0)
8 4 (object header) f9 33 01 f8 (11111001 00110011 00000001 11111000) (-134138887)
12 4 int Dummy.i 42
Instance size: 16 bytes (reported by Instrumentation API).
4 bytes compressed oop (klass word) + 8 bytes mark word + 4 bytes for the value + no space loss = 16 bytes.
The thing that does NOT make sense to me is this use-case:
-XX:+UseCompressedOops -XX:+UseCompressedClassPointers -XX:ObjectAlignmentInBytes=16
The output is this:
Running 64-bit HotSpot VM.
Using compressed oop with 4-bit shift.
Using compressed klass with 0x0000001000000000 base address and 0-bit shift.
I was really expecting to both be "4-bit shift". Why they are not?
EDIT
The second example is run with :
XX:+UseCompressedOops -XX:+UseCompressedClassPointers
And the third one with :
-XX:+UseCompressedOops -XX:+UseCompressedClassPointers -XX:ObjectAlignmentInBytes=16
Answers to these questions are mostly easy to figure out when looking into OpenJDK code.
For example, grep for "UseCompressedClassPointers", this will get you to arguments.cpp:
// Check the CompressedClassSpaceSize to make sure we use compressed klass ptrs.
if (UseCompressedClassPointers) {
if (CompressedClassSpaceSize > KlassEncodingMetaspaceMax) {
warning("CompressedClassSpaceSize is too large for UseCompressedClassPointers");
FLAG_SET_DEFAULT(UseCompressedClassPointers, false);
}
}
Okay, interesting, there is "CompressedClassSpaceSize"? Grep for its definition, it's in globals.hpp:
product(size_t, CompressedClassSpaceSize, 1*G, \
"Maximum size of class area in Metaspace when compressed " \
"class pointers are used") \
range(1*M, 3*G) \
Aha, so the class area is in Metaspace, and it takes somewhere between 1 Mb and 3 Gb of space. Let's grep for "CompressedClassSpaceSize" usages, because that will take us to actual code that handles it, say in metaspace.cpp:
// For UseCompressedClassPointers the class space is reserved above
// the top of the Java heap. The argument passed in is at the base of
// the compressed space.
void Metaspace::initialize_class_space(ReservedSpace rs) {
So, compressed classes are allocated in a smaller class space outside the Java heap, which does not require shifting -- even 3 gigabytes is small enough to use only the lowest 32 bits.
I will try to extend a little bit on the answer provided by Alexey as some things might not be obvious.
Following Alexey suggestion, if we search the source code of OpenJDK for where compressed klass bit shift value is assigned, we will find the following code in metaspace.cpp:
void Metaspace::set_narrow_klass_base_and_shift(address metaspace_base, address cds_base) {
// some code removed
if ((uint64_t)(higher_address - lower_base) <= UnscaledClassSpaceMax) {
Universe::set_narrow_klass_shift(0);
} else {
assert(!UseSharedSpaces, "Cannot shift with UseSharedSpaces");
Universe::set_narrow_klass_shift(LogKlassAlignmentInBytes);
}
As we can see, the class shift can either be 0(or basically no shifting) or 3 bits, because LogKlassAlignmentInBytes is a constant defined in globalDefinitions.hpp:
const int LogKlassAlignmentInBytes = 3;
So, the answer to your quetion:
I was really expecting to both be "4-bit shift". Why they are not?
is that ObjectAlignmentInBytes does not have any effect on compressed class pointers alignment in the metaspace which is always 8bytes.
Of course this conclusion does not answer the question:
"Why when using -XX:ObjectAlignmentInBytes=16 with -XX:+UseCompressedClassPointers the narrow klass shift becomes zero? Also, without shifting how can the JVM reference the class space with 32-bit references, if the heap is 4GBytes or more?"
We already know that the class space is allocated on top of the java heap and can be up to 3G in size. With that in mind let's make a few tests. -XX:+UseCompressedOops -XX:+UseCompressedClassPointers are enabled by default, so we can eliminate these for conciseness.
Test 1: Defaults - 8 Bytes aligned
$ java -XX:ObjectAlignmentInBytes=8 -XX:+UnlockDiagnosticVMOptions -XX:+PrintCompressedOopsMode -version
heap address: 0x00000006c0000000, size: 4096 MB, zero based Compressed Oops
Narrow klass base: 0x0000000000000000, Narrow klass shift: 3
Compressed class space size: 1073741824 Address: 0x00000007c0000000 Req Addr: 0x00000007c0000000
Notice that the heap starts at address 0x00000006c0000000 in the virtual space and has a size of 4GBytes. Let's jump by 4Gbytes from where the heap starts and we land just where class space begins.
0x00000006c0000000 + 0x0000000100000000 = 0x00000007c0000000
The class space size is 1Gbyte, so let's jump by another 1Gbyte:
0x00000007c0000000 + 0x0000000040000000 = 0x0000000800000000
and we land just below 32Gbytes. With a 3 bits class space shifting the JVM is able to reference the entire class space, although it's at the limit (intentionally).
Test 2: 16 bytes aligned
java -XX:ObjectAlignmentInBytes=16 -XX:+UnlockDiagnosticVMOptions -XX:+PrintCompressedOopsMode -version
heap address: 0x0000000f00000000, size: 4096 MB, zero based Compressed Oops
Narrow klass base: 0x0000001000000000, Narrow klass shift: 0
Compressed class space size: 1073741824 Address: 0x0000001000000000 Req Addr: 0x0000001000000000
This time we can observe that the heap address is different, but let's try the same steps:
0x0000000f00000000 + 0x0000000100000000 = 0x0000001000000000
This time around heap space ends just below 64GBytes virtual space boundary and the class space is allocated above 64Gbyte boundary. Since class space can use only 3 bits shifting, how can the JVM reference the class space located above 64Gbyte? The key is:
Narrow klass base: 0x0000001000000000
The JVM still uses 32 bit compressed pointers for the class space, but when encoding and decoding these, it will always add 0x0000001000000000 base to the compressed reference instead of using shifting. Note, that this approach works as long as the referenced chunk of memory is lower than 4Gbytes (the limit for 32 bits references). Considering that the class space can have a maximum of 3Gbytes we are comfortably within the limits.
3: 16 bytes aligned, pin heap base at 8g
$ java -XX:ObjectAlignmentInBytes=16 -XX:HeapBaseMinAddress=8g -XX:+UnlockDiagnosticVMOptions -XX:+PrintCompressedOopsMode -version
heap address: 0x0000000200000000, size: 4096 MB, zero based Compressed Oops
Narrow klass base: 0x0000000000000000, Narrow klass shift: 3
Compressed class space size: 1073741824 Address: 0x0000000300000000 Req Addr: 0x0000000300000000
In this test we are still keeping the -XX:ObjectAlignmentInBytes=16, but also asking the JVM to allocate the heap at the 8th GByte in the virtual address space using -XX:HeapBaseMinAddress=8g JVM argument. The class space will begin at 12th GByte in the virtual address space and 3 bits shifting is more than enough to reference it.
Hopefully, these tests and their results answer the question:
"Why when using -XX:ObjectAlignmentInBytes=16 with -XX:+UseCompressedClassPointers the narrow klass shift becomes zero? Also, without shifting how can the JVM reference the class space with 32-bit references, if the heap is 4GBytes or more?"

Map access bottleneck in Golang

I am using Golang to implement naive bayesian classification for a dataset with over 30000 possible tags. I have built the model and I am in the classification phase. I am working on classifying 1000 records and this is taking up to 5 minutes. I have profiled the code with pprof functionality; the top10 are shown below:
Total: 28896 samples
16408 56.8% 56.8% 24129 83.5% runtime.mapaccess1_faststr
4977 17.2% 74.0% 4977 17.2% runtime.aeshashbody
2552 8.8% 82.8% 2552 8.8% runtime.memeqbody
1468 5.1% 87.9% 28112 97.3% main.(*Classifier).calcProbs
861 3.0% 90.9% 861 3.0% math.Log
435 1.5% 92.4% 435 1.5% runtime.markspan
267 0.9% 93.3% 302 1.0% MHeap_AllocLocked
187 0.6% 94.0% 187 0.6% runtime.aeshashstr
183 0.6% 94.6% 1137 3.9% runtime.mallocgc
127 0.4% 95.0% 988 3.4% math.log10
Surprisingly the map access seems to be the bottleneck. Has anyone experienced this. What other key, value datastructure can be used to avoid this bottleneck? All the map access is done in the following piece of code given below:
func (nb *Classifier) calcProbs(data string) *BoundedPriorityQueue{
probs := &BoundedPriorityQueue{}
heap.Init(probs)
terms := strings.Split(data, " ")
for class, prob := range nb.classProb{
condProb := prob
clsProbs := nb.model[class]
for _, term := range terms{
termProb := clsProbs[term]
if termProb != 0{
condProb += math.Log10(termProb)
}else{
condProb += -6 //math.Log10(0.000001)
}
}
entry := &Item{
value: class,
priority: condProb,
}
heap.Push(probs,entry)
}
return probs
}
The maps are nb.classProb which is map[string]float64 while the nb.model is a nested map of type
map[string]map[string]float64
In addition to what #tomwilde said, another approach that may speed up your algorithm is string interning. Namely, you can avoid using a map entirely if you know the domain of keys ahead of time. I wrote a small package that will do string interning for you.
Yes, the map access will be the bottleneck in this code: it's the most significant operation inside the two nested loops.
It's not possible to tell for sure from the code that you've included, but I expect you've got a limited number of classes. What you might do, is number them, and store the term-wise class probabilities like this:
map[string][NumClasses]float64
(ie: for each term, store an array of class-wise probabilities [or perhaps their logs already precomputed], and NumClasses is the number of different classes you have).
Then, iterate over terms first, and classes inside. The expensive map lookup will be done in the outer loop, and the inner loop will be iteration over an array.
This'll reduce the number of map lookups by a factor of NumClasses. This may need more memory if your data is extremely sparse.
The next optimisation is to use multiple goroutines to do the calculations, assuming you've more than one CPU core available.

Resources