Stimulate code-inlining - go

Unlike in languages like C++, where you can explicitly state inline, in Go the compiler dynamically detects functions that are candidate for inlining (which C++ can do too, but Go can't do both). Also there's a debug option to see possible inlining happening, yet there is very few documented online about the exact logic of the go compiler(s) doing this.
Let's say I need to rerun some big loop over a set of data every n-period;
func Encrypt(password []byte) ([]byte, error) {
return bcrypt.GenerateFromPassword(password, 13)
}
for id, data := range someDataSet {
newPassword, _ := Encrypt([]byte("generatedSomething"))
data["password"] = newPassword
someSaveCall(id, data)
}
Aiming for example for Encrypt to being inlined properly what logic should I need to take into consideration for the compiler?
I know from C++ that passing by reference will increase likeliness for automatic inlining without the explicit inline keyword, but it's not very easy to understand what the compiler exactly does to determine the decisions on choosing to inline or not in Go. Scriptlanguages like PHP for example suffer immensely if you do a loop with a constant addSomething($a, $b) where benchmarking such a billion cycles the cost of it versus $a + $b (inline) is almost ridiculous.

Until you have performance problems, you shouldn't care. Inlined or not, it will do the same.
If performance does matter and it makes a noticable and significant difference, then don't rely on current (or past) inlining conditions, "inline" it yourself (do not put it in a separate function).
The rules can be found in the $GOROOT/src/cmd/compile/internal/inline/inl.go file. You may control its aggressiveness with the 'l' debug flag.
// The inlining facility makes 2 passes: first caninl determines which
// functions are suitable for inlining, and for those that are it
// saves a copy of the body. Then InlineCalls walks each function body to
// expand calls to inlinable functions.
//
// The Debug.l flag controls the aggressiveness. Note that main() swaps level 0 and 1,
// making 1 the default and -l disable. Additional levels (beyond -l) may be buggy and
// are not supported.
// 0: disabled
// 1: 80-nodes leaf functions, oneliners, panic, lazy typechecking (default)
// 2: (unassigned)
// 3: (unassigned)
// 4: allow non-leaf functions
//
// At some point this may get another default and become switch-offable with -N.
//
// The -d typcheckinl flag enables early typechecking of all imported bodies,
// which is useful to flush out bugs.
//
// The Debug.m flag enables diagnostic output. a single -m is useful for verifying
// which calls get inlined or not, more is for debugging, and may go away at any point.
Also check out blog post: Dave Cheney - Five things that make Go fast (2014-06-07) which writes about inlining (long post, it's about in the middle, search for the "inline" word).
Also interesting discussion about inlining improvements (maybe Go 1.9?): cmd/compile: improve inlining cost model #17566

Better still, don’t guess, measure!
You should trust the compiler and avoid trying to guess its inner workings as it will change from one version to the next.
There are far too many tricks the compiler, the CPU or the cache can play to be able to predict performance from source code.
What if inlining makes your code bigger to the point that it doesn’t fit in the cache line anymore, making it much slower than the non-inlined version? Cache locality can have a much bigger impact on performance than branching.

Related

how to understand the relation between uintptr and struct?

I have learned code like the following
func str2bytes(s string) []byte {
x := (*[2]uintptr)(unsafe.Pointer(&s))
h := [3]uintptr{x[0], x[1], x[1]}
return *(*[]byte)(unsafe.Pointer(&h))
}
this function is to change string to []byte without the stage copying data.
I try to convert num to reverseNum
type Num struct {
name int8
value int8
}
type ReverseNum struct {
value int8
name int8
}
func main() {
n := Num{100, 10}
z := (*[2]uintptr)(unsafe.Pointer(&n))
h := [2]uintptr{z[1], z[0]}
fmt.Println(*(*ReverseNum)(unsafe.Pointer(&h))) // print result is {0, 0}
}
this code doesn't get the result I want.
Can anybody tell my about
That's too compilcated.
A simpler
package main
import (
"fmt"
"unsafe"
)
type Num struct {
name int8
value int8
}
type ReverseNum struct {
value int8
name int8
}
func main() {
n := Num{name: 42, value: 12}
p := (*ReverseNum)(unsafe.Pointer(&n))
fmt.Println(p.value, p.name)
}
outputs "42, 12".
But the real question is why on Earth would you want to go for such trickery instead of copying two freaking bytes which is done instantly on any sensible CPU Go programs run on?
Another problem with your approach is that IIUC nothing in the Go language specification guarantees that two types which have seemingly identical fields must have identical memory layouts. I beleive they should on most implementations but I do not think they are required to do that.
Also consider that seemingly innocuous things like also having an extra field (even of type struct{}!) in your data type may do interesting things to memory layouts of the variables of those types, so it may be outright dangerous to assume you may reinterpret memory of Go variables the way you want.
... I just want to learn about the principle behind the package unsafe.
It's an escape hatch.
All strongly-typed but compiled languages have a basic problem: the actual machines on which the compiled programs will run do not have the same typing system as the compiler.1 That is, the machine itself probably has a linear address space where bytes are assembled into machine words that are grouped into pages, and so on. The operating system may also provide access at, say, page granularity: if you need more memory, the OS will give you one page—4096 bytes, or 8192 bytes, or 65536 bytes, or whatever the page size is—of additional memory at a time.
There are many ways to attack this problem. For instance, one can write code directly in machine (or assembly) language, using the hardware's instruction set, to talk to the OS to achieve OS-level things. This code can then talk to the compiled program, acting as the go-between. If the compiled program needs to allocate a 40-byte data structure, this machine-level code can figure out how to do that within the strictures of the OS's page-size allocations.
But writing machine code is difficult and time-consuming. That's precisely why we have high-level languages and compilers in the first place. What if we had a way to, within the high-level language, violate the normal rules imposed by the language? By violating specific requirements in specific ways, carefully coordinating those ways with all other code that also violates those requirements, we can, in code we keep away from the usual application programming, write much of our memory-management, process-management, and so on in our high-level language.
In other words, we can use unsafe (or something similar in other languages) to deliberately break the type-safety provided by our high level language. When we do this—when we break the rules—we must know what all the rules are, and that our specific violations here will function correctly when combined with all the normal code that does obey the normal rules and when combined with all the special, unsafe code that breaks the rules.
This often requires help from the compiler itself. If you inspect the runtime source distributed with Go, you will find routines with annotations like go:noescape, go:noinline, go:nosplit, and go:nowritebarrier. You need to know when and why these are required if you are going to make much use of some of the escape-hatch programming.
A few of the simpler uses, such as tricks to gain access to string or slice headers, are ... well, they are still unsafe, but they are unsafe in more-predictable ways and do not require this kind of close coordination with the compiler itself.
To understand how, when, and why they work, you need to understand how the compiler and runtime allocate and work with strings and slices, and in some cases, how memory is laid out on the hardware, and some of the rules about Go's garbage collector. In particular, the GC code is aware of unsafe.Pointer but not of uintptr. Much of this is pretty tricky: see, e.g., https://utcc.utoronto.ca/~cks/space/blog/programming/GoUintptrVsUnsafePointer and the link to https://github.com/golang/go/issues/19135, in which writing nil to a Go pointer value caused Go's garbage collector to complain, because the write caused the GC to inspect the previously stored value, which was invalid.
1See this Wikipedia article on the Intel 432 for a notable attempt at designing hardware to run compiled high level languages. There have been others in the past as well, often with the same fate, though some IBM projects have been more successful.

arm-gcc mktime binary size

I need to perform simple arithmetic on struct tm from time.h. I need to add or subtract seconds or minutes, and be able to normalize the structure. Normally, I'd use mktime(3) which performs this normalization as a side effect:
struct tm t = {.tm_hour=0, .tm_min=59, .tm_sec=40};
t.tm_sec += 30;
mktime(&t);
// t.tm_hour is now 1
// t.tm_min is now 0
// t.tm_sec is now 10
I'm doing this on an STM32 with 32 kB of flash, and binary gets very big. mktime(3) and the other stuff it pulls in take up 16 kB of flash--half the available space.
Is there a function in newlib that is specifically responsible for struct tm normalization? I realize that linking to a private function like that would make the code less portable.
There is a validate_structure() function in newlib/libc/time/mktime.c which does a part of the job, normalizes month, day-of-month, hour, min, sec, but leaves day-of-week and day-of-year alone.
It's declared static, so you can't simply call it, but you can copy the function from the sources. (There might be licensing issues though). Or you can just reimplement it, it's quite straightforward.
The tm_wday and tm_yday is calculated later in mktime(), so you'd need the whole mess including the timezone stuff in order to have these two normalized.
The bulk of that 16kB code is related to a call to siscanf(), a variant of sscanf() without floating point support, which is (I believe) used to parse timezone and DST information in environment variables.
You can cut lots of unnecessary code by using --specs=nano.specs when linking, which would switch to simplified printf/scanf code, saving about 10kB of code in your case.

Sharing memory with the kernel and compiler optimizations

a frame is shared with a kernel.
User-space code:
read frame // read frame content
_mm_mfence // prevent before "releasing" a frame before we read everything.
frame.status = 0 // "release" a frame
Kernel code:
poll for frame.status // reads a frame's status
_mm_lfence
Kernel can poll it asynchronically, in another thread. So, there is no syscall between userspace code and kernelspace.
Is it correctly synchronized?
I doubt because of the following situation:
A compiler has a weak memory model and we have to assume that it can do wild changes as you can imagine if optimizied/changed program is consistent within one-thread.
So, on my eye we need a second barrier because it is possible that a compiler optimize out store frame.status, 0.
Yes, it will be a very wild optimization but if a compiler would be able to prove that noone in the context (within thread) reads that field it can optimize out it.
I believe that it is theoretically possibe, isn't it?
So, to prevent that we can put the second barrier:
User-space code:
read frame // read frame content
_mm_mfence // prevent before "releasing" a frame before we read everything.
frame.status = 0 // "release" a frame
_mm_fence
Ok, now compiler restrain itself before optimization.
What do you think?
EDIT
[The question is raised by the issue that __mm_fence does not prevent before optimizations-out.
#PeterCordes, to make sure myself: __mm_fence does not prevent before optimizations out (it is just x86 memory barrier, not compiler). However, atomic_thread_fence(any_order) prevents before reorderings (it depends on any_order, obviously) but it also prevents before optimizations out?
For example:
// x is an int pointer
*x = 5
*(x+4) = 6
std::atomic_thread_barrier(memory_order_release)
prevents before optimizations out of stores to x? It seems that it must- otherwise every store to x should be volatile.
However, I saw a lot of lock-free code and there is no making fields as volatile.
_mm_mfence is also a compiler barrier. (See When should I use _mm_sfence _mm_lfence and _mm_mfence, and also BeeOnRope's answer there).
atomic_thread_fence with release, rel_acq, or seq_cst stops earlier stores from merging with later stores. But mo_acquire doesn't have to.
Writes to non-atomic globals variables can only be optimized out by merging with other writes to the same non-atomic variables, not by optimizing them away entirely. So the real question is what reorderings can happen that can let two non-atomic assignments come together.
There has to be an assignment to an atomic variable in there somewhere for there to be anything that another thread could synchronize with. Some compilers might give atomic_thread_fence stronger behaviour wrt. non-atomic variables, but in C++11 there's no way for another thread to legally observer anything about the ordering of *x and x[4] in
#include <atomic>
std::atomic<int> shared_flag {0};
int x[8];
void writer() {
*x = 0;
x[4] = 0;
atomic_thread_fence(mo_release);
x[4] = 1;
atomic_thread_fence(mo_release);
shared_flag.store(1, mo_relaxed);
}
The store to shared_flag has to appear after the stores to x[0] and x[4], but it's only an implementation detail what order the stores to x[0] and x[4] happen in, and whether there are 2 stores to x[4].
For example, on the Godbolt compiler explorer gcc7 and earlier merge the stores to x[4], but gcc8 doesn't, and neither do clang or ICC. The old gcc behaviour does not violate the ISO C++ standard, but I think they strengthened gcc's thread_fence because it wasn't strong enough to prevent bugs in other cases.
For example,
void writer_gcc_bug() {
*x = 0;
std::atomic_thread_fence(std::memory_order_release);
shared_flag.store(1, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_release);
*x = 2; // gcc7 and earlier merge this, which arguably a bug
}
gcc only does shared_flag = 1; *x = 2; in that order. You could argue that there's no way for another thread to safely observe *x after seeing shared_flag == 1, because this thread writes it again right away with no synchronization. (i.e. data race UB in any potential observer makes this reordering arguably legal).
But gcc developers don't think that's enough reason, (it may be violating the guarantees of the builtin __atomic functions that the <atomic> header uses to implement the API). And there may be other cases where there is a real bug that even a standards-conforming program could observe the aggressive reordering that violated the standard.
Apparently this changed on 2017-09 with the fix for gcc bug 80640.
Alexander Monakov wrote:
I think the bug is that on x86 __atomic_thread_fence(x) is expanded into nothing for x!=__ATOMIC_SEQ_CST, it should place a compiler barrier similar to expansion of __atomic_signal_fence.
(__atomic_signal_fence includes something as strong as asm("" ::: "memory" ).)
Yup that would definitely be a bug. So it's not that gcc was being really clever and doing allowed reorderings, it was just mostly failing at thread_fence, and any correctness that did happen was due to other factors, like non-inline function boundaries! (And that it doesn't optimize atomics, only non-atomics.)

Protecting memory from changing

Is there a way to protect an area of the memory?
I have this struct:
#define BUFFER 4
struct
{
char s[BUFFER-1];
const char zc;
} str = {'\0'};
printf("'%s', zc=%d\n", str.s, str.zc);
It is supposed to operate strings of lenght BUFFER-1, and garantee that it ends in '\0'.
But compiler gives error only for:
str.zc='e'; /*error */
Not if:
str.s[3]='e'; /*no error */
If compiling with gcc and some flag might do, that is good as well.
Thanks,
Beco
To detect errors at runtime take a look at the -fstack-protector-all option in gcc. It may be of limited use when attempting to detect very small overflows like the one your described.
Unfortunately you aren't going to find a lot of info on detecting buffer overflow scenarios like the one you described at compile-time. From a C language perspective the syntax is totally correct, and the language gives you just enough rope to hang yourself with. If you really want to protect your buffers from yourself you can write a front-end to array accesses that validates the index before it allows access to the memory you want.

D Dynamic Arrays - RAII

I admit I have no deep understanding of D at this point, my knowledge relies purely on what documentation I have read and the few examples I have tried.
In C++ you could rely on the RAII idiom to call the destructor of objects on exiting their local scope.
Can you in D?
I understand D is a garbage collected language, and that it also supports RAII.
Why does the following code not cleanup the memory as it leaves a scope then?
import std.stdio;
void main() {
{
const int len = 1000 * 1000 * 256; // ~1GiB
int[] arr;
arr.length = len;
arr[] = 99;
}
while (true) {}
}
The infinite loop is there so as to keep the program open to make residual memory allocations easy visible.
A comparison of a equivalent same program in C++ is shown below.
It can be seen that C++ immediately cleaned up the memory after allocation (the refresh rate makes it appear as if less memory was allocated), whereas D kept it even though it had left scope.
Therefore, when does the GC cleanup?
scope declarations are going in D2, so I'm not terribly certain on the semantics, but what I'd imagine is happening is that scope T[] a; only allocates the array struct on the stack (which needless to say, already happens, regardless of scope). As they are going, don't use scope (using scope(exit) and friends is different -- keep using them).
Dynamic arrays always use the GC to allocate their memory -- there's no getting around that. If you want something more deterministic, using std.container.Array would be the simplest manner, as I think you could pretty much drop it in where your scope vector3b array is:
Array!vector3b array
Just don't bother setting the length to zero -- the memory will be free'd once it goes out of scope (Array uses malloc/free from libc under the hood).
No, you cannot assume that the garbage collector will collect your object at any point in time.
There is, however, a delete keyword (as well as a scope keyword) that can delete an object deterministically.
scope is used like:
{
scope auto obj = new int[5];
//....
} //obj cleaned up here
and delete is used like in C++ (there's no [] notation for delete).
There are some gotcha's, though:
It doesn't always work properly (I hear it doesn't work well with arrays)
The developers of D (e.g. Andrei) are intending to remove them in later versions, because it can obviously mess up things if used incorrectly. (I personally hate this, given that it's so easy to screw things up anyway, but they're sticking with removing it, and I don't think people can convince them otherwise although I'd love it if that was the case.)
In its place, there is already a clear method that you can use, like arr.clear(); however, I'm not quite sure what it exactly does yet myself, but you could look at the source code in object.d in the D runtime if you're interested.
As to your amazement: I'm glad you're amazed, but it shouldn't be really surprising considering that they're both native code. :-)

Resources