How can a compiler cross-compile to a different OS and architecture?

How can a compiler cross-compile to a different OS and architecture? - go

I'm very intrigued by the fact that Go (since v1.5) has in-built cross compilation options.
But how is it possible to compile for a different OS and architecture?
I mean that would require knowing (and probably behaving like) the target machine language and platform.

I mean that would require knowing (and probably behaving like) the target machine language and platform.
Yes, the Go compiler has to know how the target operating system works, but it doesn't need to behave like the target OS, as the Go compiler will not run the compiled executable binary, it just needs to produce it.
All the Go tools need to know is the binary formats of the different Operating Systems, and OS and architectural details (such as the instruction set, word size, endianness, alignment, available registers etc.; more info on this). And this knowledge is built into the Go tools.

Related

Does the compiler actually produce Machine Code?

I've been reading that in most cases (like gcc) the compiler reads the source code in a high level language and spits out the corresponding machine code. Now, machine code by definition is the code that a processor can understand directly. So, machine code should be only machine (processor) dependent and OS independent. But this is not the case. Even if 2 different operating systems are running on the same processor, I can not run the same compiled file (.exe for Windows or .out for Linux) on both the Operating Systems.
So, what am I missing? Is the output of a gcc compiler (and most compilers) not Machine Code? Or is Machine Code not the lowest level of code and the OS translated it further to a set of instructions that the processor can execute?

You are confusing a few things. I retargettable compiler like gcc and other generic compilers compile files to objects, then the linker later links objects with other libraries as needed to make a so called binary that the operating system can then read, parse, load the loadable blocks and start execution.
A sane compiler author will use assembly language as the output of the compiler then the compiler or the user in their makefile calls the assembler which creates the object. This is how gcc works. And how clang works sorta, but llc can make objects directly now not just assembly that gets assembled.
It makes far more sense to generate debuggable assembly language that produce raw machine code. You really need a good reason like JIT to skip the step. I would avoid toolchains that go straight to machine code just because they can, they are harder to maintain and more likely to have bugs or take longer to fix bugs.
If the architecture is the same there is no reason why you cant have a generic toolchain generate code for incompatible operating systems. the gnu tools for example can do this. Operating system differences are not by definition at the machine code level most are at the high level language level C libraries that you can to create gui windows, etc have nothing to do with the machine code nor the processor architecture, for some operating systems the same operating system specific C code can be used on mips or arm or powerpc or x86. where the architecture becomes specific is the mechanism that actual system calls are invoked. A specific instruction is often used. and machine code is eventually used yes but no reason why this cant be coded in real or inline assembly.
And then this leads to libraries, even fopen and printf which are generic C calls eventually have to make a system call so much of the library support code can be in a compatible across systems high level language, there will need to be a system and architecture specific bit of code for the last mile. You should see this in glibc sources, or hooks into newlib for example in other library solutions. As examples.
Same is true for other languages like C++ as it is for C. Interpreted languages have additional layers but their virtual machines are just programs that sit on similar layers.
Low level programming doesnt mean machine nor assembly language it just means whatever programming language you are using accesses at a lower level, below the application or below the operating system, etc...

Compilers produce assembly code, which is a human-readable version of machine code (eg, instead of 1's and 0's you have actual commands). However, the correct assembly/machine code needed to make your program run correctly is different depending on the operating system. So the language the processors use is the same, but your program needs to talk to the operating system, which is different.
For example, say you're writing a Hello World program. You need to print the phrase "Hello, World" onto the screen. Your program, will need to go through the OS to actually do that, and different OSes have different interfaces.
I'm deliberately avoiding technical terms here to keep the answer understandable for beginners. To be more precise, your program needs to go through the operating system to interact with the other hardware on your computer(eg, keyboard, display). This is done through system calls that are different for each family of OS.

The machine code that is generated can run on any of the same type of processor it was generated for. The challenge is that your code will interact with other modules or programs on the system and to do that you need a conventions for calling and returning. The code generated assumes a runtime environment (OS) as well as library support (calling conventions). Those are not consistent across operating systems.
So, things break when they need to transition to and depend on other modules using conventions defined by the operating system's calling conventions.

Even if the machine code instructions are identical for the compiled program on two different operating systems (not at all likely, since different operating systems provide different services in different ways), the machine code needs to be stored in a format that the host OS can use "load into" a process for execution. And those formats are frequently different between different operating systems.

Can I run C program compiled on different ARM processor?

Let's say I compiled C-program on RaspberryPi, can I run this binary on let's say Cubietruck?
How to know for sure that 2 ARM processors are compatible? Are they all compatible between each other?
It should be some easy answer referring instruction set supported by processors, but I can't find any good materials on that.

There are several conditions for that:
Your executable should use the "least common denominator" of all the ARM microarchitectures you wish to support. See gcc's -march=... option for that. Assuming you're running Linux, grep '^model' /proc/cpuinfo should give you that information for each platform.
(related) Some features may not be supported by all your target ARM cores (FPU, NEON, etc...), so be very careful with that.
You should, of course, run the same OS on all supported platforms.
You need to make sure that all supported platforms run the same ABI; ARM has an history of ABIs changes, so you must take this into consideration.
If you're lucky enough to target only reasonably modern ARM platforms, you should be able to find some common ground (EABI or Hard Float ABI). Otherwise you probably have no choice but to maintain several versions of your executable.

Cross-compile on a Linux host for various targets

I have a set of more or less portable C/C++ sources sitting on a Linux development host that I would like to be able to:
compile for 32- and 64-bit Linux targets
cross-compile for 32- and 64-bit Windows targets
cross-compile for 32- and 64-bit Mac targets
and, ideally, without any runtime dependencies on other emulation DLL's like cygwin1.dll, MinGW, etc though I could use them if there's no other choice. If I have to use them, I'd prefer statically linking their functionality to my code.
The target binary that is desired is:
a shared library (.so) for Linux and Mac targets, and
a DLL for Windows.
I have no idea how to build a cross-compiler (and the associated toolchain) from scratch. I'm hearing that pre-built cross-compiler toolchains are available for various host-and-target combinations, but I don't know where to find them, or even how to use them without running into runtime crashes/coredumps later due to pointer model subtleties (LP64, LLP64, etc), specifying wrong or inadequate compiler switches, other misconfiguration, etc.
I've so far been unable to find the relevant and complete information on the above, and whatever little I've managed to find is scattered all over the place in so many bits and pieces that I'm not even sure if all that I've read is complete or even correct (applies fully, no more no less to my case).
I'm not a compilers expert, just their regular user. Would appreciate information achieving the above compilation goals.

I would like to cross compile a library for Mac OsX on Linux and I am considering imcross. The instructions in the site are simple, but everytime you setup a crosscompiling environment you have to fix a lot of things, so I won't expect that it will be straightforward. You can check in the website that there are some limitations to this project but it is the best I came across.
Not being a priority for me now (I have other stuff to do before performing this task) I didn't setup the crossenvironment yet. I am going to do that in few days time.

Distro provided cross compiler vs custom built gcc

I intend to cross compile for Raspberry Pi, basically a small ARM computer. The host will be an i686 box running Arch Linux.
My first instinct is to use cross compiler provided by Arch Linux, arm-elf-gcc-base and arm-elf-binutils. However, every wiki and post I read seems to use some version of custom gcc build. They seem to spend significant time on cooking their own gcc. Problem is that they never say WHY it is important to use their gcc over another.
Can stock distro provided cross compilers be used for building Raspberry Pi or ARM in general kernels and apps?
Is it necessary to have multiple compilers for ARM architecture? If so, why, since single gcc can support all x86 variants?
If 2), then how can I deduce what target subset is supported by a particular version of gcc?
More general question, what general use cases call for custom gcc build?
Please be as technical as you can, I'd like to know WHY as well as how.

When developers talk about building software (cross compiling) for a different machine (target) compared to their own (host) they use the term toolchain to describe the set of tools necessary to build binary files. That's because when you need to build an executable binary, you need more than a compiler.
You need routines (crt0.o) to initialize runtime according to requirements of operating system and standard libraries. You need standard set of libraries and those libraries need to be aware of the kernel on target because of the system calls API and several os level configurations (f.e. page size) and data structures (f.e. time structures).
On the hardware side, there are different set of ARM architectures. Architectures can be backward compatible but a toolchain by nature is binary and targeted for a specific architecture. You can have the most widespread architecture by default but then that won't be too fruitful for an already constraint environment (embedded device). If you have the latest architecture, then it won't be useful for older architecture based targets.
When you build a binary on your host for your host, compiler can look up all the necessary bits from its own environment or use what's on the host - so most of the above details are invisible to developer. However when you build for a different target than your host type, toolchain must know about hardware, os and standard library details. The way you tell these to toolchain is... by building it according to those details which might require some level of bootstrapping. (or you can do this via extensive set of parameters if toolchain supports / built for it.)
So when there is a generic (stock) cross compile toolchain, it has already some target specifics set and that might not meet your requirements. Please see this recent question about the situation on Ubuntu for an example.

Why are "Executable files" operating system dependent?

I understand that each CPU/architecture has it's own instruction set, therefore a program(binary) written for a specific CPU cannot run on another. But what i don't really understand is why an executable file (binary like .exe for instance) cannot run on Linux but can run on windows even on the very same machine.
This is a basic question, and the answer i'm expecting is that .exe and other binary formats are probably not Raw machine instructions but they contain some data that is operating system dependent. If this is true, then what this OS dependent data is like? and as an example what is the format of an .exe file and the difference between it and Linux executables?
Is there a source i can get brief and detailed information about this?

In order to do something meaningful, applications will need to interface with the OS. Since system calls and user-space infrastructure look fundamentally different on Windows and Unix/Linux, having different formats for executable programs is the smallest trouble. It's the program logic that would need to be changed.
(You might argue that this is meaningless if you have a program that solely depends on standardized components, for example the C runtime library. This is theoretically true - but irrelevant for most applications since they are forced to use OS-dependent stuff).
The other differences between Windows PE (EXE,DLL,..) files and Linux ELF binaries are related to the different image loaders and some design characteristics of both OSs. For example on Linux a separate program is used to resolve external library imports while this functionality is built-in on Windows. Another example: Linux shared libraries function differently than DLLs on Windows. Not to mention that both formats are optimized to enable the respective OS kernels to load programs as quick as possible.
Emulators like Wine try to fill the gap (and actually prove that the biggest problem is not the binary format but rather the OS interface!).

.exe and other binary formats are [definitely] not Raw machine instructions but they contain some data that is operating system dependent.
what this OS dependent data is like? and as an example what is the format of an .exe file and the difference between it and Linux executables?
Well, I guess Google failed you utterly. .EXE formats are very well-defined by Windows documentation.
http://support.microsoft.com/kb/65122
The Linux ld application loads an executable into memory prior to "exec" to that file. You could read up on ld format or even the famous a.out file.
http://linux.die.net/man/1/ld
http://en.wikipedia.org/wiki/A.out
http://en.wikipedia.org/wiki/Executable

Apart from the executable format that must be recognized by the system loader (i.e. that part of an OS that brings the executable into memory) the real problem is the interface to the OS. You can think of an OS as a kind of API that provides entry points one must call for doing specific things, like for example, writing a character to the console.
These details are usually more or less hidden from the end user, so that you can achieve writing a character to the screen with the same source code in higher level languages. But often, things are more different, like for example the Windowing environment. Not all high level languages provide a windowing layer that abstracts even over those differences.

I can't comment too much on *nix but yes, the code part of the binary is typically happy to run on either environment, but it is the OS that places certain demands on the binary. In windows you should read up on PE Headers.
The second part is simply up to the developer, many times the code part will reference libaries that are OS specific - which is why you can have both portable and non-portable C++ code before being compiled into a binary.

A very naive answer:
Their structure are different because of different process loaders;
The use os-dependent features like syscalls, which vary from OS to OS.

Programs need to know how to invoke operating system services. How this is done depends on the operating system: some use interrupts, some use the x86 lcall instruction, some (notably Windows) have distinguished shared libraries and don't document how to directly invoke services. Old 680x0 Macs and some other 680x0 operating systems used a reserved instruction set area and trapped the resulting "invalid CPU opcode" exception. Moreover, even when the mechanism is the same, the order and argument format of system calls differs between operating systems (and sometimes different versions of the same operating system; see stat() in the Linux kernel for an example of an interface that has changed several times).
There is some ability to deal with other operating systems' conventions: FreeBSD has the "linuxulator" which handles the Linux-specific kernel interface, NetBSD similarly has emulators for the system call formats of other operating systems using the same hardware (say, Ultrix on MIPS or OSF/1 on Alpha), Linux used to have iBCS2 to handle the UnixWare/SCO Unix kernel interface, Wine provides replacement shared libraries and a binary loader for PE-style Windows executables. (I don't recall if Wine also supports OS/2-style LX .exes; it probably does handle original format .exe; and then there's .com which is a raw memory dump with a header slapped on.) Even so, there is always some format that uses different conventions, and sometimes the conventions are similar enough to require hints to the OS as to how to deal with it. (See bless on FreeBSD, for example.)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio