I've been reading that in most cases (like gcc) the compiler reads the source code in a high level language and spits out the corresponding machine code. Now, machine code by definition is the code that a processor can understand directly. So, machine code should be only machine (processor) dependent and OS independent. But this is not the case. Even if 2 different operating systems are running on the same processor, I can not run the same compiled file (.exe for Windows or .out for Linux) on both the Operating Systems.
So, what am I missing? Is the output of a gcc compiler (and most compilers) not Machine Code? Or is Machine Code not the lowest level of code and the OS translated it further to a set of instructions that the processor can execute?
You are confusing a few things. I retargettable compiler like gcc and other generic compilers compile files to objects, then the linker later links objects with other libraries as needed to make a so called binary that the operating system can then read, parse, load the loadable blocks and start execution.
A sane compiler author will use assembly language as the output of the compiler then the compiler or the user in their makefile calls the assembler which creates the object. This is how gcc works. And how clang works sorta, but llc can make objects directly now not just assembly that gets assembled.
It makes far more sense to generate debuggable assembly language that produce raw machine code. You really need a good reason like JIT to skip the step. I would avoid toolchains that go straight to machine code just because they can, they are harder to maintain and more likely to have bugs or take longer to fix bugs.
If the architecture is the same there is no reason why you cant have a generic toolchain generate code for incompatible operating systems. the gnu tools for example can do this. Operating system differences are not by definition at the machine code level most are at the high level language level C libraries that you can to create gui windows, etc have nothing to do with the machine code nor the processor architecture, for some operating systems the same operating system specific C code can be used on mips or arm or powerpc or x86. where the architecture becomes specific is the mechanism that actual system calls are invoked. A specific instruction is often used. and machine code is eventually used yes but no reason why this cant be coded in real or inline assembly.
And then this leads to libraries, even fopen and printf which are generic C calls eventually have to make a system call so much of the library support code can be in a compatible across systems high level language, there will need to be a system and architecture specific bit of code for the last mile. You should see this in glibc sources, or hooks into newlib for example in other library solutions. As examples.
Same is true for other languages like C++ as it is for C. Interpreted languages have additional layers but their virtual machines are just programs that sit on similar layers.
Low level programming doesnt mean machine nor assembly language it just means whatever programming language you are using accesses at a lower level, below the application or below the operating system, etc...
Compilers produce assembly code, which is a human-readable version of machine code (eg, instead of 1's and 0's you have actual commands). However, the correct assembly/machine code needed to make your program run correctly is different depending on the operating system. So the language the processors use is the same, but your program needs to talk to the operating system, which is different.
For example, say you're writing a Hello World program. You need to print the phrase "Hello, World" onto the screen. Your program, will need to go through the OS to actually do that, and different OSes have different interfaces.
I'm deliberately avoiding technical terms here to keep the answer understandable for beginners. To be more precise, your program needs to go through the operating system to interact with the other hardware on your computer(eg, keyboard, display). This is done through system calls that are different for each family of OS.
The machine code that is generated can run on any of the same type of processor it was generated for. The challenge is that your code will interact with other modules or programs on the system and to do that you need a conventions for calling and returning. The code generated assumes a runtime environment (OS) as well as library support (calling conventions). Those are not consistent across operating systems.
So, things break when they need to transition to and depend on other modules using conventions defined by the operating system's calling conventions.
Even if the machine code instructions are identical for the compiled program on two different operating systems (not at all likely, since different operating systems provide different services in different ways), the machine code needs to be stored in a format that the host OS can use "load into" a process for execution. And those formats are frequently different between different operating systems.
Related
Let's say that I would code a program with Windows API and then compile it. The code is compiled to machine code for the CPU to execute. Now, my question is: If I share the executable file for someone else with another instruction set in their CPU. How can their CPU run the code the same way and not give errors or run a different code?
someone else with another instruction set in their CPU
...
How can their CPU run the code the same way
The code won't run. The CPU's, simply put, speak another language.
You have two options
recompile your code for the target CPU (assuming you can use the same source language and no platform specific API, so you're left with C/C++ with stdlib)
Write a script / bytecode and use a runtime available for both platforms to interpret the script (or bytecode)
That's why there are Runtime installations such as JVM (for Java) and scripts (Python, Scala, Lua, JavaScript, etc) where the code is in a form of a script or as platform independent code.
And now - next step. If you're using Windows API, well - as the name suggests - it's API (services) provided by the Windows system. So even using the same CPU without the Windows system (e.g. on a Linux system), the application won't run. (ok, there is often a way how to expose Windows API on Linux, but it can be tricky sometimes).
Conclusion: Binaries are not portable between instruction sets, if you're using any high level API (Win32, ...), you're pretty much hooked to the operating system too
When high-level languages are compiled into executable, often they are compiled to intermediate code. This is a representation of the source code compiled closer to assembly language, however it is not specific to any CPU instruction set. It is up to the machine running the executable to interpret this intermediate code and run it in the CPU's native instruction set.
I'm very intrigued by the fact that Go (since v1.5) has in-built cross compilation options.
But how is it possible to compile for a different OS and architecture?
I mean that would require knowing (and probably behaving like) the target machine language and platform.
I mean that would require knowing (and probably behaving like) the target machine language and platform.
Yes, the Go compiler has to know how the target operating system works, but it doesn't need to behave like the target OS, as the Go compiler will not run the compiled executable binary, it just needs to produce it.
All the Go tools need to know is the binary formats of the different Operating Systems, and OS and architectural details (such as the instruction set, word size, endianness, alignment, available registers etc.; more info on this). And this knowledge is built into the Go tools.
I just Likely know that in which platform operating system coded.
as per my knowledge.
Windows kernel written in C language.
Linux kernel is also written in C language.
but remain operating system in?
In which Platform C language is written?
Yes, the Windows kernel and Linux kernels are written in C. Most operating systems tend to be.
There are operating systems written in other languages though, the Chorus kernel for example is written in C++.
Most C compilers are also written in C. That has the advantage that once you managed to get the compiler running on the machine (generally by compiling it on another machine that already has a working compiler/cross compiler), the machine itself can compile updates to its own compiler without maintaining yet another compiler.
Most parts of the C compiler (like gcc) are written in C themselves. Of course you would need something to bootstrap your compiler such that it can compile itself. That would then be a lower type language like Assembler.
The C language is one of many languages that are considered to be Self Hosting - that is to say that the compiler can compile its own source code, which is written in the same language that the compiler is designed to compile.
You might also want to look into the process of Bootstrapping, which is the process used to get the first compiler for a particular language to run on a given platform - as others have noted, this can be by way of cross-compiling, or by writing the original compiler in a different language, though other techniques are possible.
First off, you might want to improve your question with actual sentences.
Second,
C is not written in a platform, it is written in another programming language.
Most compilers are written in assembler, a somewhat readable version of the actual machine codes sent to the processor.
I don't know if there are other compilers, written in some intermediary language but eventually, everything boils down to assembler code, which compiles to machine code.
Does this introduction occurs at the NTLDR stage because it must be introduce, I mean isn't the Kernel written in C? I thought a computers only "known-before" programming language was Assembly Language that is hard coded at the Microcode of the Processor?
The first operating systems were all written in assembly. The C Language was created because its first use case was the creation of UNIX. A C compiler was written to handle this code and produce the assembly that the system understands (compiler was written in assembly of course). The effect snowballs from there. We now have a more powerful system to write code so we can of course write better compilers and better software with a more high level approach and let the compiler do the work for us.
As far as Windows is concerned it was a rewrite of an operating system called QDOS which was written in C.
Sidenote: Operating systems still require assembly code to function as there are many hardware independent pieces of information required (for example CR2 read on page fault on x86). Bootloaders and BIOS (older ones) are written in assembly because they are very specific to the hardware and are required to setup things such as interrupts and the stack pointer.
C is a compiled language, as opposed to an interpreted language. C programs as well as the C runtime library are compiled into machine code, so they don't need any kind of runtime environment such as an interpreter or virtual machine to be loaded in order to execute.
The entry point of a compiled program (including a kernel) will call into its runtime library and perform any initialization required before executing the program, but this is all machine code.
I understand that each CPU/architecture has it's own instruction set, therefore a program(binary) written for a specific CPU cannot run on another. But what i don't really understand is why an executable file (binary like .exe for instance) cannot run on Linux but can run on windows even on the very same machine.
This is a basic question, and the answer i'm expecting is that .exe and other binary formats are probably not Raw machine instructions but they contain some data that is operating system dependent. If this is true, then what this OS dependent data is like? and as an example what is the format of an .exe file and the difference between it and Linux executables?
Is there a source i can get brief and detailed information about this?
In order to do something meaningful, applications will need to interface with the OS. Since system calls and user-space infrastructure look fundamentally different on Windows and Unix/Linux, having different formats for executable programs is the smallest trouble. It's the program logic that would need to be changed.
(You might argue that this is meaningless if you have a program that solely depends on standardized components, for example the C runtime library. This is theoretically true - but irrelevant for most applications since they are forced to use OS-dependent stuff).
The other differences between Windows PE (EXE,DLL,..) files and Linux ELF binaries are related to the different image loaders and some design characteristics of both OSs. For example on Linux a separate program is used to resolve external library imports while this functionality is built-in on Windows. Another example: Linux shared libraries function differently than DLLs on Windows. Not to mention that both formats are optimized to enable the respective OS kernels to load programs as quick as possible.
Emulators like Wine try to fill the gap (and actually prove that the biggest problem is not the binary format but rather the OS interface!).
.exe and other binary formats are [definitely] not Raw machine instructions but they contain some data that is operating system dependent.
what this OS dependent data is like? and as an example what is the format of an .exe file and the difference between it and Linux executables?
Well, I guess Google failed you utterly. .EXE formats are very well-defined by Windows documentation.
http://support.microsoft.com/kb/65122
The Linux ld application loads an executable into memory prior to "exec" to that file. You could read up on ld format or even the famous a.out file.
http://linux.die.net/man/1/ld
http://en.wikipedia.org/wiki/A.out
http://en.wikipedia.org/wiki/Executable
Apart from the executable format that must be recognized by the system loader (i.e. that part of an OS that brings the executable into memory) the real problem is the interface to the OS. You can think of an OS as a kind of API that provides entry points one must call for doing specific things, like for example, writing a character to the console.
These details are usually more or less hidden from the end user, so that you can achieve writing a character to the screen with the same source code in higher level languages. But often, things are more different, like for example the Windowing environment. Not all high level languages provide a windowing layer that abstracts even over those differences.
I can't comment too much on *nix but yes, the code part of the binary is typically happy to run on either environment, but it is the OS that places certain demands on the binary. In windows you should read up on PE Headers.
The second part is simply up to the developer, many times the code part will reference libaries that are OS specific - which is why you can have both portable and non-portable C++ code before being compiled into a binary.
A very naive answer:
Their structure are different because of different process loaders;
The use os-dependent features like syscalls, which vary from OS to OS.
Programs need to know how to invoke operating system services. How this is done depends on the operating system: some use interrupts, some use the x86 lcall instruction, some (notably Windows) have distinguished shared libraries and don't document how to directly invoke services. Old 680x0 Macs and some other 680x0 operating systems used a reserved instruction set area and trapped the resulting "invalid CPU opcode" exception. Moreover, even when the mechanism is the same, the order and argument format of system calls differs between operating systems (and sometimes different versions of the same operating system; see stat() in the Linux kernel for an example of an interface that has changed several times).
There is some ability to deal with other operating systems' conventions: FreeBSD has the "linuxulator" which handles the Linux-specific kernel interface, NetBSD similarly has emulators for the system call formats of other operating systems using the same hardware (say, Ultrix on MIPS or OSF/1 on Alpha), Linux used to have iBCS2 to handle the UnixWare/SCO Unix kernel interface, Wine provides replacement shared libraries and a binary loader for PE-style Windows executables. (I don't recall if Wine also supports OS/2-style LX .exes; it probably does handle original format .exe; and then there's .com which is a raw memory dump with a header slapped on.) Even so, there is always some format that uses different conventions, and sometimes the conventions are similar enough to require hints to the OS as to how to deal with it. (See bless on FreeBSD, for example.)