Why does CGO_ENABLE make a such impact on virtual memory? - go

I have a small daemon written on Golang, which works in a loop and does some stuff. I've discovered, the daemon behaves differently in cases when it's compiled with CGO_ENABLE=1 or CGO_ENABLED=0. For example, with CGO_ENABLE=1 (which is default) the program's VSZ bloats up to 1-2GB during short period of time (within a hour). With CGO_ENABLED=0, VSZ is the same during long period of time (over days). Look at the numbers below:
CGO_ENABLED=1 (daemon has worked 5 minutes)
$ grep -E 'VmSize|VmRSS' /proc/14916/status
VmSize: 1084052 kB
VmRSS: 12524 kB
CGO_ENABLED=0 (daemon has worked ~30 hours)
$ grep -E 'VmSize|VmRSS' /proc/15160/status
VmSize: 110232 kB
VmRSS: 9756 kB
The daemon is not used CGO-dependent packages or functions. Other Go-written programs show the same behaviour. I know the difference between VSZ and RSS and I'm interesting what is the nature of such behaviour? Why program compiled with CGO_ENABLED=1 asks to provide so much memory from the kernel?
I would prefer answers that are not in the form "don't worry, VSZ is a just virtual memory, and really it's not used by process".

I could make an educated guess.
As you probably know, the compiler of the "reference" Go implementation (historically dubbed "gc"; that one, available for download from the main site) by default produces statically-linked binaries. This means, such binaries rely only on the so-called "system calls" provided by the OS kernel and do not depend on any shared libraries provided by the OS (or 3rd parties).
On Linux-based platforms, this is not completely true: in the default setting (building on Linux for Linux, i.e., not cross-compiling) the generated binary is actually linked with libc
and with libpthread (indirectly, via libc).
This "twist" comes out of the two needs the Go standard library has to interact with the OS:
DNS resolving, which is needed by the net package.
User and group lookup, which is needed by the os package.
The problem here is two-fold:
The Linux itself (that is, the kernel, not the whole OS) does not provide any means to carry out those tasks.
Any typical UNIX-like system, since forever, provides for both those tasks using a special facility called "NSS",
which is the "Name-Service Switch"¹.
The NSS provides for pluggable modules which can serve
as the databases offering queries of particular type: DNS, user/group database, and more (such as well-known names for "services" etc). A supposedly rather common example of
a non-standard provider for the user/group databases is a local
service which contacts an LDAP server.
On a typical GNU/Linux-based OS the NSS is implemented by
libc (on less typical systems it might be provided by a
separate shared library but this does not change much).
Since — again, typically, — the libc is a rather stable
library in terms of its API (it even provides versioned symbols
to be future-proof), the Go authors rightfully decided that linking against libc to import a minimal subset of symbols (mostly getaddrinfo, getnameinfo, getpwnam_r etc) is OK
to be done by default as it's safe for 99% of cases,
and when it isn't, those who have to tackle these cases usually
know what to do anyway.
So, by default cgo is enabled and used to implement these lookups using NSS.
If cgo is disabled, the Go compiler instead links in its own
fallback implementations which try to mimic a subset of what a
full-blown NSS implementation does (i.e. parse /etc/resolv.conf and use the information from it to directly query the DNS servers listed here; parse /etc/passwd and /etc/group to serve the user/group database queries).
As you can see, in the defult case,
The libc gets mapped in, and
It is initialized and uses some memory for its own needs —
such as obvious caching of the data the NSS calls return.
Conversely, in the case when cgo is disabled, the above two things do not happen. You have more stdlib code linked in statically but looks like the default case merely trumps the latter one in terms of the overall cumulative RSS usage.
Consider studying the output of
this query
for additional fun ;-)
¹ not to be confused with Mozilla's libnss.

Related

How to specify the physical CoreIDs used for "CLOSE" when specifying OMP_PROC_BIND?

We are trying to optimize HPC applications using OpenMP on a new hardware platform. These applications need precise placement/pinning of their cores or performance falls in half. Currently, we provide the user a custom GOMP_CPU_AFFINITY map for each platform, but this is cumbersome, because it's different on each hardware version, and even platforms with different firmware versions sometimes change their CoreID physical mappings - all things impossible for the user to detect on the fly.
It would be a great help if HPC applications could simply set GOMP_PROC_BIND to "close" and OpenMP would do the right thing for the given platform - but to make this possible, the hardware vendor would need to define what "close" means for each machine. We'd like to do this, but we can't tell how/where OpenMP gets CoreID lists to use for things like close, spread, etc. (For various external requirements, the CoreID spatial pattern on this machine would appear utterly random to a software writer.)
Any advice as to where/how OpenMP defines the CoreID lists for OMP_PROC_BIND so we could configure them? We are comfortable with the idea that we might need a custom version of OpenMP (with altered source code) for this platform if needed.
Thanks, everyone. :)
Jeff
Expanding on what #VictorEijkhout said...
You seem have invented an envirable that I can't find anywhere with Google (GOMP_PROC_BIND), with the OpenMP standard envirable (OMP_PROC_BIND). If GOMP_PROC_BIND exists the name suggests that it is a GNU feature. Note too that one of the two Google hits for GOMP_PROC_BIND says "Code that reads the setting is buggy. Setting is invalid and ignored at runtime." So, if you are setting that it is unsurprising that it has no effect!
I will therefore answer for the more general case of OMP_PROC_BIND.
The binding of OpenMP threads to logicalCPUs clearly has to be done at runtime, since, beyond its ISA, the compiler has no knowledge of the hardware on which the compiled code will run. Therefore you need to be looking at the runtime library code.
I have not looked at GNU's libgomp, but, where it can, LLVM's libomp uses the hwloc library to explore the machine hardware. Since hwloc also includes other useful tools for machine exploration (such as lstopo) it is likely that your effort is best invested in ensuring good hwloc support on your machine, at which point there will be no need to delve inside the OpenMP runtime.

Is Go developed enough to use it to make the core of an operating system?

I'm wondering if Go is developed enough to use it to make the core of an operating system? So basically replace what you would normally use C for with Go.
Of course you can develop an OS in almost any (Turing complete) language. Usually there's some small assembly layer required, though. And usually one must implement some parts of the OS using only a restricted subset of the language in question.
Examples:
JavaOS.
Singularity. (Applies with some limits only.)
What concerns Go, there used to be a usable (toy) Go kernel implementation, but it is now obsoleted already for a long time. From rsc's post:
In the repository history there is a toy kernel called "tiny".
If you run hg log -k tiny you'll find it. It doesn't build anymore
with the current version of Go but it illustrates what might
be done. It had the whole package runtime, including the
garbage collector, in the kernel.
Russ

high performance runtime

It’s the first time I submit a question in this forum.
I’m posting a general question. I don’t have to develop an application for a specific purpose.
After a lot of “googling” I still haven’t found a language/runtime/script engine/virtual machine that match these 5 requirements:
memory allocation of variables/values or objects cleaned at run time
(e.g. a la C++ that use keyword delete or free in C )
language (and consequently the program) is a script or
pseudo-compiled a la byte code that should be portable on main
operating system (windows, linux, *bsd, solaris) & platform(32/64bit)
native use of multicore (engine/runtime)
no limit on the heap usage
library for network
The programming language for building application and that run on this engine is agnostic oriented (paradigm is not important).
I hope that this post won’t stir up a Holy-War but I'd like to put focus on engine behavior during program execution.
Sorry for my bad english.
Luke
I think Erlang might fit your requirement:
most data is either allocated in local scopes and therefore immediately deleted after use or contained in a library-powered permanent storage like ETS, DETS or Mnesia. There is Garbage Collection, though, but the paradigm of the language makes the need for it not as important.
the Erlang compiler compiles the source code to the BEAM virtual machine byte code, which, unlike Java is register-based and thus much faster. The VM is available for:
Solaris (including 64 bit)
BSD
Linux
OSX
TRU64
Windows NT/2000/2003/XP/Vista/7
VxWorks
Erlang has been designed for distributed systems, concurrency and reliability from day one
Erlang's Heap grows with your demand for it, it's initially limited and expanded automatically (there are numerous tweaks you can use to configure this on a per-VM-basis)
Erlang comes from a networking background and provides tons of libraries from IP to higher-level protocols

Is it possible to write a libPOSIX for Windows (Win32) without requiring a background service or DLL that's always loaded?

I know about Cygwin, and I know of its shortcomings. I also know about the slowness of fork, but not why on Earth it's not possible to work around that. I also know Cygwin requires a DLL. I also understand POSIX defines a whole environment (shell, etc...), that's not really what I care about here.
My question is asking if there is another way to tackle the problem. I see more and more of POSIX functionality being implemented by the MinGW projects, but there's no complete solution providing a full-blown (comparable to Linux/Mac/BSD implementation status) POSIX functionality.
The question really boils down to:
Can the Win32 API (as of MSVC20??) be efficiently used to provide a complete POSIX layer over the Windows API?
Perhaps this will turn out to be a full libc that only taps into the OS library for low-level things like filesystem access, threads, and process control. But I don't know exactly what else POSIX consists of. I doubt a library can turn Win32 into a POSIX compliant entiity.
POSIX <> Win32.
If you're trying to write apps that target POSIX, why are you not using some variant of *N*X? If you prefer to run Windows, you can run Linux/BSD/whatever inside Hyper-V/VMWare/Parallels/VirtualBox on your PC/laptop/etc.
Windows used to have a POSIX compliant environment that ran alongside the Win32 subsystem, but was discontinued after NT4 due to lack of demand. Microsoft bought Interix and released Services For Unix (SFU). While it's still available for download, SFU 3.5 is now deprecated and no longer developed or supported.
As to why fork is so slow, you need to understand that fork isn't just "Create a new process", it's "create a new process (itself an expensive operation) which is a duplicate of the calling process along with all memory".
In *N*X, the forked process is mapped to the same memory pages as the parent (i.e. is pretty quick) and is only given new pages as and when the forked process tried to modify any shared pages. This is known as copy on write. This is largely achievable because in UNIX, there is no hard barrier between the parent and forked processes.
In NT, on the other hand, all processes are separated by a barrier enforced by CPU hardware. In NT, the easiest way to spawn a parallel activity which has access to your process' memory and resources, is to create a thread. Threads run within the memory space of the creating process and have access to all of the process' memory and resources.
You can also share data between processes via various forms of IPC, RPC, Named Pipes, mailslots, memory-mapped files but each technique has its own complexities, performance characteristics, etc. Read this for more details.
Because it tries to mimic UNIX, CygWin's 'fork' operation creates a new child process (in its own isolated memory space) and has to duplicate every page of memory in the parent process within the newly forked child. This can be a very costly operation.
Again, if you want to write POSIX code, do so in *N*X, not NT.
How about this
Most of the Unix API is implemented by the POSIX.DLL dynamically loaded (shared) library. Programs linked with POSIX.DLL run under the Win32 subsystem instead of the POSIX subsystem, so programs can freely intermix Unix and Win32 library calls.
From http://en.wikipedia.org/wiki/UWIN
The UWIN environment may be what you're looking for, but note that it is hosted at research.att.com, while UWIN is distributed under a liberal license it is not the GNU license. Also, as it is research for att, and only 2ndarily something that they are distributing for use, there are a lot of issues with documentation.
See more info see my write-up as the last answer for Regarding 'for' loop in KornShell
Hmm main UWIN link is bad link in that post, try
http://www2.research.att.com/sw/download/
Also, You can look at
https://mailman.research.att.com/pipermail/uwin-users/
OR
https://mailman.research.att.com/pipermail/uwin-developers/
To get a sense of the features vs issues.
I hope this helps.
The question really boils down to: Can the Win32 API (as of MSVC20??)
be efficiently used to provide a complete POSIX layer over the Windows
API?
Short answer: No.
"Complete POSIX" means fork(), mmap(), signal() and such, and these are [almost] impossible to implement on NT.
To drive the point home: GNU Hurd has problems with fork() as well, because Hurd kernel is not POSIX.
NT is not POSIX too.
Another difference is persisence:
In POSIX-compliant systems it is possible to create system objects and leave them there. Examples of such objects are named pipes and shared memory objects (shms). You can create a named pipe or a shm, and leave it in the filesystem (or in a special filesystem-like place) where other processes will be able to access it. The downside is that a process might die and fail to clean up after itself, leaving unused objects behind (you know about zombie processes? same thing).
In NT every object is reference-counted, and is destroyed as soon as its last handle is closed. Files are among the few objects that persist.
Symlinks are a filesystem feature, and don't exactly depend on NT kernel, but current implementation (in Vista and later) is incapable of creating object-type-agnostic symlinks. That is, a symlink is either a file or a directory, and must link to either a file or a directory. If the target has wrong type, the symlink won't work. You can give it the right type if the target exists when you create the symlink, but POSIX requires that symlinks may be created without their target existing. I can't imagine a use-case for a symlink that points first to a file, then to a directory, but POSIX says that this should work, and if it doesn't, you're not completely POSIX-compliant. Or if your symlinking API/utility can be given an option that specifies the right type, when target doesn't exist, that also breaks POSIX compatibility.
It is possible to replicate some POSIX features to some degree (such as "integer descriptors from in a single namespace, referencing any I/O object, and being select()able" without sacrificing [much] performance, but that is still a major undertaking, and POSIX interface is really restrictive (that is, if you could just add one more argument to that function, it would have been possible to Do The Right Thing...but you couldn't, unless you want to throw POSIX compliance away).
Your best bet is to not to rely on POSIX features that are difficult to port to non-POSIX systems, or abstract in such a way that lower levels may have separate implementations for different OSes, and upper levels do not care about the details.

Why are "Executable files" operating system dependent?

I understand that each CPU/architecture has it's own instruction set, therefore a program(binary) written for a specific CPU cannot run on another. But what i don't really understand is why an executable file (binary like .exe for instance) cannot run on Linux but can run on windows even on the very same machine.
This is a basic question, and the answer i'm expecting is that .exe and other binary formats are probably not Raw machine instructions but they contain some data that is operating system dependent. If this is true, then what this OS dependent data is like? and as an example what is the format of an .exe file and the difference between it and Linux executables?
Is there a source i can get brief and detailed information about this?
In order to do something meaningful, applications will need to interface with the OS. Since system calls and user-space infrastructure look fundamentally different on Windows and Unix/Linux, having different formats for executable programs is the smallest trouble. It's the program logic that would need to be changed.
(You might argue that this is meaningless if you have a program that solely depends on standardized components, for example the C runtime library. This is theoretically true - but irrelevant for most applications since they are forced to use OS-dependent stuff).
The other differences between Windows PE (EXE,DLL,..) files and Linux ELF binaries are related to the different image loaders and some design characteristics of both OSs. For example on Linux a separate program is used to resolve external library imports while this functionality is built-in on Windows. Another example: Linux shared libraries function differently than DLLs on Windows. Not to mention that both formats are optimized to enable the respective OS kernels to load programs as quick as possible.
Emulators like Wine try to fill the gap (and actually prove that the biggest problem is not the binary format but rather the OS interface!).
.exe and other binary formats are [definitely] not Raw machine instructions but they contain some data that is operating system dependent.
what this OS dependent data is like? and as an example what is the format of an .exe file and the difference between it and Linux executables?
Well, I guess Google failed you utterly. .EXE formats are very well-defined by Windows documentation.
http://support.microsoft.com/kb/65122
The Linux ld application loads an executable into memory prior to "exec" to that file. You could read up on ld format or even the famous a.out file.
http://linux.die.net/man/1/ld
http://en.wikipedia.org/wiki/A.out
http://en.wikipedia.org/wiki/Executable
Apart from the executable format that must be recognized by the system loader (i.e. that part of an OS that brings the executable into memory) the real problem is the interface to the OS. You can think of an OS as a kind of API that provides entry points one must call for doing specific things, like for example, writing a character to the console.
These details are usually more or less hidden from the end user, so that you can achieve writing a character to the screen with the same source code in higher level languages. But often, things are more different, like for example the Windowing environment. Not all high level languages provide a windowing layer that abstracts even over those differences.
I can't comment too much on *nix but yes, the code part of the binary is typically happy to run on either environment, but it is the OS that places certain demands on the binary. In windows you should read up on PE Headers.
The second part is simply up to the developer, many times the code part will reference libaries that are OS specific - which is why you can have both portable and non-portable C++ code before being compiled into a binary.
A very naive answer:
Their structure are different because of different process loaders;
The use os-dependent features like syscalls, which vary from OS to OS.
Programs need to know how to invoke operating system services. How this is done depends on the operating system: some use interrupts, some use the x86 lcall instruction, some (notably Windows) have distinguished shared libraries and don't document how to directly invoke services. Old 680x0 Macs and some other 680x0 operating systems used a reserved instruction set area and trapped the resulting "invalid CPU opcode" exception. Moreover, even when the mechanism is the same, the order and argument format of system calls differs between operating systems (and sometimes different versions of the same operating system; see stat() in the Linux kernel for an example of an interface that has changed several times).
There is some ability to deal with other operating systems' conventions: FreeBSD has the "linuxulator" which handles the Linux-specific kernel interface, NetBSD similarly has emulators for the system call formats of other operating systems using the same hardware (say, Ultrix on MIPS or OSF/1 on Alpha), Linux used to have iBCS2 to handle the UnixWare/SCO Unix kernel interface, Wine provides replacement shared libraries and a binary loader for PE-style Windows executables. (I don't recall if Wine also supports OS/2-style LX .exes; it probably does handle original format .exe; and then there's .com which is a raw memory dump with a header slapped on.) Even so, there is always some format that uses different conventions, and sometimes the conventions are similar enough to require hints to the OS as to how to deal with it. (See bless on FreeBSD, for example.)

Resources