Assembly language and compiled languages - performance

How is assembly faster than compiled languages if both are translated to machine code?
I'm talking about truly compiled languages which are translated to machine code. Not C# or Java which are compiled to an intermediate language first and then compiled to native code by a software interpreter, etc.
On Wikipedia, I found something which I'm not sure if it's in any way related to this. Is it because that translation from a higher level language generates extra machine code? Or is my understanding wrong?
A utility program called an assembler is used to translate assembly language statements into the target computer's machine code. The assembler performs a more or less isomorphic translation (a one-to-one mapping) from mnemonic statements into machine instructions and data. This is in contrast with high-level languages, in which a single statement generally results in many machine instructions.

Well, it relates a bit to your question, indeed. The point is that compilers produce inefficient machine code at times for various reasons, such as not being able to completely analyze your code, inserting automatic range checks, automatic checks for objects being null, etc.
On the other hand if you write assembler code by hand and know what you're doing, then you can probably write some things much more efficient than the compiler, although the compiler's behavior may be tweaked and you can usually tell it not to do range checking, for example.
Most people, however, will not write better assembler code than a compiler, simply because compilers are written by people who know a good deal of really weird but really cool optimizations. Also things like loop unrolling are usually a pain to write yourself and make the resulting code faster in many cases.
While it's generally true that everything that a computer executes is machine code, the code that runs differs greatly depending on how many abstraction levels you put between the machine and the programmer. For Assembler that's one level, for Java there are a few more ...
Also many people mistakenly believe that certain optimizations at a higher abstraction layer pay off at a lower one. This is not necessarily the case and the compiler may just have trouble understanding what you are trying to do and fail to properly optimize it.

Assembly may sometimes be faster than a compiled language if an assembly programmer writes better assembly than that generated by the compiler.
A compiled language is often faster than assembly because programmers who write compilers usually know the CPU architecture better than programmers who are utilizing assembly in a one-off, limited-case, situation.

An assembly expert may be able to write assembly code that is more effective (fewer instructions, more efficient instructions, SIMD, ...) than what a compiler generates automatically.
However, most of the time, you're better off trusting the optimizer of your compiler.
Learn what your compiler does. Then let the compiler do it.

My standard answer when questions about assembly vs. high-level come up is to take a look at Michael Abrash's Graphics Programming Black Book.
The first couple of chapters give a good idea of what you can optimise effectively using assembly, and what you can't.
You can download it from GameDev - Jeff's links seem to be broken now unfortunately.

All good answers. My only additional point is that programmers tend to write a certain number of lines of code per day, regardless of language. Since the advantage of a high-level language is that it lets you get more done with less code, it takes incredible programmer discipline to actually write less code.
This is especially an issue for performance because it matters almost nowhere except in a tiny part of the code. It only matters in your hotspots - code that you write (1) consuming a significant fraction of execution time (2) without calling functions (3).

First of all, compilers generate very good (fast) assembly code.
It's true that compilers can add extra code since high order languages have mechanisms, like virtual methods and exceptions in C++. Thus the compiler will have to produce more code. There are cases where raw assembly could speed up the code but that's rare nowdays.

First - assembler should be used only in small code pieces, which eat most of the CPU time in a program - some kind of calculations for example - in the "bottle neck" of algorithm.
Secondly - it depends on experience in ASM of those who implements the same code in Assembler. If the assembler implementation of "bottle neck" code will be faster. If experience is low - it will be slower. And it will contain a lot of bugs. If experience is high enough - ASM will give significant profit.

How is assembly faster than compiled languages if both are translated to machine code?
The implicit assumption is hand-written assembly code. Of course, most compilers (e.g. GCC for C, C++, Fortran, Go, D etc...) are generating some assembler code; for example you might compile your foo.cc C++ source code with g++ -fverbose-asm -Wall -S -O2 -march=native foo.cc and look into the generated foo.s assembler code.
However, efficient assembler code is so difficult to write that, today, compilers can optimize better than human do. See this.
So practically speaking, it is not worth coding in assembler (also, take into account that development efforts cost very often much more than the hardware running the compiled code). Even when performance matters a lot and is worth spending a lot of money, it is better to hand-code only very few routines in assembler, or even to embed some assembler code in some of your C routines.
Look into the CppCon 2017 talk: Matt Godbolt “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”

Related

Difference between compiled and interpreted languages?

What are the relative strengths and weaknesses of compiled and interpreted languages?
Neither approach has a clear advantage over the other - if one approach was always better, chances are that we'd start using it everywhere!
Generally speaking, compilers offer the following advantages:
Because they can see all the code up-front, they can perform a number of analyses and optimizations when generating code that makes the final version of the code executed faster than just interpreting each line individually.
Compilers can often generate low-level code that performs the equivalent of a high-level ideas like "dynamic dispatch" or "inheritance" in terms of memory lookups inside of tables. This means that the resulting programs need to remember less information about the original code, lowering the memory usage of the generated program.
Compiled code is generally faster than interpreted code because the instructions executed are usually just for the program itself, rather than the program itself plus the overhead from an interpreter.
Generally speaking, compilers have the following drawbacks:
Some language features, such as dynamic typing, are difficult to compile efficiently because the compiler can't predict what's going to happen until the program is actually run. This means that the compiler might not generate very good code.
Compilers generally have a long "start-up" time because of the cost of doing all the analysis that they do. This means that in settings like web browsers where it's important to load code fast, compilers might be slower because they optimize short code that won't be run many times.
Generally speaking, interpreters have the following advantages:
Because they can read the code as written and don't have to do expensive operations to generate or optimize code, they tend to start up faster than compilers.
Because interpreters can see what the program does as its running, interpreters can use a number of dynamic optimizations that compilers might not be able to see.
Generally speaking, interpreters have the following disadvantages:
Interpreters typically have higher memory usage than compilers because the interpreter needs to keep more information about the program available at runtime.
Interpreters typically spend some CPU time inside of the code for the interpreter, which can slow down the program being run.
Because interpreters and compilers have complementary strengths and weaknesses, it's becoming increasingly common for language runtimes to combine elements of both. Java's JVM is a good example of this - the Java code itself is compiled, and initially it's interpreted. The JVM can then find code that's run many, many times and compile it directly to machine code, meaning that "hot" code gets the benefits of compilation while "cold" code does not. The JVM can also perform a number of dynamic optimizations like inline caching to speed up performance in ways that compilers typically don't.
Many modern JavaScript implementations use similar tricks. Most JavaScript code is short and doesn't do all that much, so they typically start off using an interpreter. However, if it becomes clear that the code is being run repeatedly, many JS engines will compile the code - or at least, compile bits and pieces of it - and optimize it using standard techniques. The net result is that the code is fast at startup (useful for loading web pages quickly) but gets faster the more that it runs.
One last detail is that languages are not compiled or interpreted. Usually, C code is compiled, but there are C interpreters available that make it easier to debug or visualize the code that's being run (they're often used in introductory programming classes - or at least, they used to be.) JavaScript used to be thought of as an interpreted language until some JS engines started compiling it. Some Python implementations are purely interpreters, but you can get Python compilers that generate native code. Now, some languages are easier to compile or interpret than others, but there's nothing stopping you from making a compiler or interpreter for any particular programming language. There's a theoretical result called the Futamura projections that shows that anything that can be interpreted can be compiled, for example.
Because the Start up time it’s cheaper and you can read interpreters program as is makes it a no brainer to me for the interpreter

Unified assembly language

I wonder if there exists some kind of universal and easy-to-code opcode (or assembly) language which provides basic set of instructions available in most of today's CPUs (not some fancy CISC, register-only computer, just common one). With possibility to "compile", micro-optimize and "interpret" on any mentioned CPUs?
I'm thinking about something like MARS MIPS simulator (rather simple and easy to read code), with possibility to make real programs. No libraries necessary (but nice thing if that possible), just to make things (libraries or UNIX-like tools) faster in uniform way.
Sorry if that's silly question, I'm new to assembler. I just don't find NASM or UNIX assembly language neither extremely cross-platform nor easy to read and code.
The JVM bytecode is sort of like assembly language, but without pointer arithmetic. However, it's quite object-oriented. On the positive side, it's totally cross-platform.
You might want to look at LLVM bytecode - but bear in mind this warning: http://llvm.org/docs/FAQ.html#can-i-compile-c-or-c-code-to-platform-independent-llvm-bitcode
First thing: writing in Assembly does not guarantee a speed increase. Using the correct algorithm for the job at hand has the greatest impact on speed. By the time you need to go down to Assembly to squeeze the last few drops out you can only really do that by adapting the algorithm to the specific architecture of the hardware in question. A generic HLA (High Level Assembler) pretty much defeats the purpose of writing your code in Assembly. Note that I am not knocking Randall Hyde’s HLA, which is a great product, I’m just saying that you don’t gain anything from writing Assembly the way a compiler generates machine code. Most C and C++ compilers have very good optimizers, and can produce machine code superior to almost any naïve implementation in ASM.
See if you can find these books (2nd hand, they are out of print) by Michael Abrash: "Zen of Assembly Language", and "Zen of Code Optimization". Or look if you can find his articles on DDJ. They will give you an insight into optimization second to none,
Related stuff, so I hope might be useful :
There is
flat assembler
with an approach of a kind of portable assembler.
Interesting project of operating system with graphical user interface written in assembler, and great assembly API :
Menuet OS
LLVM IR provides quite portable assembly, backed with powerful compiler, backing many projects including Clang

How does Go compile so quickly?

I've Googled and poked around the Go website, but I can't find an explanation for Go's extraordinary build times. Are they products of the language features (or lack thereof), a highly optimized compiler, or something else? I'm not trying to promote Go; I'm just curious.
Dependency analysis.
The Go FAQ used to contain the following sentence:
Go provides a model for software
construction that makes dependency
analysis easy and avoids much of the
overhead of C-style include files and
libraries.
While the phrase is not in the FAQ anymore, this topic is elaborated upon in the talk Go at Google, which compares the dependency analysis approach of C/C++ and Go.
That is the main reason for fast compilation. And this is by design.
I think it's not that Go compilers are fast, it's that other compilers are slow.
C and C++ compilers have to parse enormous amounts of headers - for example, compiling C++ "hello world" requires compiling 18k lines of code, which is almost half a megabyte of sources!
$ cpp hello.cpp | wc
18364 40513 433334
Java and C# compilers run in a VM, which means that before they can compile anything, the operating system has to load the whole VM, then they have to be JIT-compiled from bytecode to native code, all of which takes some time.
Speed of compilation depends on several factors.
Some languages are designed to be compiled fast. For example, Pascal was designed to be compiled using a single-pass compiler.
Compilers itself can be optimized too. For example, the Turbo Pascal compiler was written in hand-optimized assembler, which, combined with the language design, resulted in a really fast compiler working on 286-class hardware. I think that even now, modern Pascal compilers (e.g. FreePascal) are faster than Go compilers.
There are multiple reasons why the Go compiler is much faster than most C/C++ compilers:
Top reason: Most C/C++ compilers exhibit exceptionally bad designs (from compilation speed perspective). Also, from compilation speed perspective, some parts of the C/C++ ecosystem (such as editors in which programmers are writing their code) aren't designed with speed-of-compilation in mind.
Top reason: Fast compilation speed was a conscious choice in the Go compiler and also in the Go language
The Go compiler has a simpler optimizer than C/C++ compilers
Unlike C++, Go has no templates and no inline functions. This means that Go doesn't need to perform any template or function instantiation.
The Go compiler generates low-level assembly code sooner and the optimizer works on the assembly code, while in a typical C/C++ compiler the optimization passes work on an internal representation of the original source code. The extra overhead in the C/C++ compiler comes from the fact that the internal representation needs to be generated.
Final linking (5l/6l/8l) of a Go program can be slower than linking a C/C++ program, because the Go compiler is going through all of the used assembly code and maybe it is also doing other extra actions that C/C++ linkers aren't doing
Some C/C++ compilers (GCC) generate instructions in text form (to be passed to the assembler), while the Go compiler generates instructions in binary form. Extra work (but not much) needs to be done in order to transform the text into binary.
The Go compiler targets only a small number of CPU architectures, while the GCC compiler targets a large number of CPUs
Compilers which were designed with the goal of high compilation speed, such as Jikes, are fast. On a 2GHz CPU, Jikes can compile 20000+ lines of Java code per second (and the incremental mode of compilation is even more efficient).
Compilation efficiency was a major design goal:
Finally, it is intended to be fast: it should take at most a few seconds to build a large executable on a single computer. To meet these goals required addressing a number of linguistic issues: an expressive but lightweight type system; concurrency and garbage collection; rigid dependency specification; and so on. FAQ
The language FAQ is pretty interesting in regards to specific language features relating to parsing:
Second, the language has been designed to be easy to analyze and can be parsed without a symbol table.
While most of the above is true, there is one very important point that was not really mentionend: Dependency management.
Go only needs to include the packages that you are importing directly (as those already imported what they need). This is in stark contrast to C/C++, where every single file starts including x headers, which include y headers etc. Bottom line: Go's compiling takes linear time w.r.t to the number of imported packages, where C/C++ take exponential time.
A good test for the translation efficiency of a compiler is self-compilation: how long does it take a given compiler to compile itself? For C++ it takes a very long time (hours?). By comparison, a Pascal/Modula-2/Oberon compiler would compile itself in less than one second on a modern machine [1].
Go has been inspired by these languages, but some of the main reasons for this efficiency include:
A clearly defined syntax that is mathematically sound, for efficient scanning and parsing.
A type-safe and statically-compiled language that uses separate compilation with dependency and type checking across module boundaries, to avoid unnecessary re-reading of header files and re-compiling of other modules - as opposed to independent compilation like in C/C++ where no such cross-module checks are performed by the compiler (hence the need to re-read all those header files over and over again, even for a simple one-line "hello world" program).
An efficient compiler implementation (e.g. single-pass, recursive-descent top-down parsing) - which of course is greatly helped by points 1 and 2 above.
These principles have already been known and fully implemented in the 1970s and 1980s in languages like Mesa, Ada, Modula-2/Oberon and several others, and are only now (in the 2010s) finding their way into modern languages like Go (Google), Swift (Apple), C# (Microsoft) and several others.
Let's hope that this will soon be the norm and not the exception. To get there, two things need to happen:
First, software platform providers such as Google, Microsoft and Apple should start by encouraging application developers to use the new compilation methodology, while enabling them to re-use their existing code base. This is what Apple is now trying to do with the Swift programming language, which can co-exist with Objective-C (since it uses the same runtime environment).
Second, the underlying software platforms themselves should eventually be re-written over time using these principles, while simultaneously redesigning the module hierarchy in the process to make them less monolithic. This is of course a mammoth task and may well take the better part of a decade (if they are courageous enough to actually do it - which I am not at all sure in the case of Google).
In any case, it's the platform that drives language adoption, and not the other way around.
References:
[1] http://www.inf.ethz.ch/personal/wirth/ProjectOberon/PO.System.pdf, page 6: "The compiler compiles itself in about 3 seconds". This quote is for a low cost Xilinx Spartan-3 FPGA development board running at a clock frequency of 25 MHz and featuring 1 MByte of main memory. From this one can easily extrapolate to "less than 1 second" for a modern processor running at a clock frequency well above 1 GHz and several GBytes of main memory (i.e. several orders of magnitude more powerful than the Xilinx Spartan-3 FPGA board), even when taking I/O speeds into account. Already back in 1990 when Oberon was run on a 25MHz NS32X32 processor with 2-4 MBytes of main memory, the compiler compiled itself in just a few seconds. The notion of actually waiting for the compiler to finish a compilation cycle was completely unknown to Oberon programmers even back then. For typical programs, it always took more time to remove the finger from the mouse button that triggered the compile command than to wait for the compiler to complete the compilation just triggered. It was truly instant gratification, with near-zero wait times. And the quality of the produced code, even though not always completely on par with the best compilers available back then, was remarkably good for most tasks and quite acceptable in general.
Go was designed to be fast, and it shows.
Dependency Management: no header file, you just need to look at the packages that are directly imported (no need to worry about what they import) thus you have linear dependencies.
Grammar: the grammar of the language is simple, thus easily parsed. Although the number of features is reduced, thus the compiler code itself is tight (few paths).
No overload allowed: you see a symbol, you know which method it refers to.
It's trivially possible to compile Go in parallel because each package can be compiled independently.
Note that Go isn't the only language with such features (modules are the norm in modern languages), but they did it well.
Quoting from the book "The Go Programming Language" by Alan Donovan and Brian Kernighan:
Go compilation is notably faster than most other compiled languages, even when building from scratch. There are three main reasons for the compiler’s speed. First, all imports must be explicitly listed at the beginning of each source file, so the compiler does not have to read and process an entire file to determine its dependencies. Second, the dependencies of a package form a directed acyclic graph, and because there are no cycles, packages can be compiled separately and perhaps in parallel. Finally, the object file for a compiled Go package records export information not just for the package itself, but for its dependencies too. When compiling a package, the compiler must read one object file for each import but need not look beyond these files.
The basic idea of compilation is actually very simple. A recursive-descent parser, in principle, can run at I/O bound speed. Code generation is basically a very simple process. A symbol table and basic type system is not something that requires a lot of computation.
However, it is not hard to slow down a compiler.
If there is a preprocessor phase, with multi-level include directives, macro definitions, and conditional compilation, as useful as those things are, it is not hard to load it down. (For one example, I'm thinking of the Windows and MFC header files.) That is why precompiled headers are necessary.
In terms of optimizing the generated code, there is no limit to how much processing can be added to that phase.
Simply ( in my own words ), because the syntax is very easy ( to analyze and to parse )
For instance, no type inheritance means, not problematic analysis to find out if the new type follows the rules imposed by the base type.
For instance in this code example: "interfaces" the compiler doesn't go and check if the intended type implement the given interface while analyzing that type. Only until it's used ( and IF it is used ) the check is performed.
Other example, the compiler tells you if you're declaring a variable and not using it ( or if you are supposed to hold a return value and you're not )
The following doesn't compile:
package main
func main() {
var a int
a = 0
}
notused.go:3: a declared and not used
This kinds of enforcements and principles make the resulting code safer, and the compiler doesn't have to perform extra validations that the programmer can do.
At large all these details make a language easier to parse which result in fast compilations.
Again, in my own words.
Go imports dependencies once for all files, so the import time doesn't increase exponentially with project size.
Simpler linguistics means interpreting them takes less computing.
What else?

GCC: Inline assembly - good for?

So I just found out GCC could do inline assembly and I was wondering two things:
What's the benefit of being able to inline assembly?
Is it possible to use GCC as an assembly compiler/assembler to learn assembly?
I've found a couple articles but they are all oldish, 2000 and 2001, not really sure of their relevance.
Thanks
The benefit of inline assembly is to have the assembly code, inlined (wait wait, don't kill me). By doing this, you don't have to worry about calling conventions, and you have much more control of the final object file (meaning you can decide where each variable goes- to which register or if it's memory stored), because that code won't be optimized (assuming you use the volatile keyword).
Regarding your second question, yes, it's possible. What you can do is write simple C programs, and then translate them to assembly, using
gcc -S source.c
With this, and the architecture manuals (MIPS, Intel, etc) as well as the GCC manual, you can go a long way.
There's some material online.
http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html
http://gcc.gnu.org/onlinedocs/gcc-4.4.2/gcc/
The downside of inline assembly, is that usually your code will not be portable between different compilers.
Hope it helps.
Inline Assembly is useful for in-place optimizations, and access to CPU features not exposed by any libraries or the operating system.
For example, some applications need strict tracking of timing. On x86 systems, the RDTSC assembly command can be used to read the internal CPU timer.
Time Stamp Counter - Wikipedia
Using GCC or any C/C++ compiler with inline assembly is useful for small snippets of code, but many environments do not have good debugging support- which will be more important when developing projects where inline assembly provides specific functionality. Also, portability will become a recurring issue if you use inline assembly. It is preferable to create specific items in a suitable environment (GNU assembler, MASM) and import them projects as needed.
Inline assembly is generally used to access hardware features not otherwise exposed by the compiler (e.g. vector SIMD instructions where no intrinsics are provided), and/or for hand-optimizing performance critical sections of code where the compiler generates suboptimal code.
Certainly there is nothing to stop you using the inline assembler to test routines you have written in assembly language; however, if you intend to write large sections of code you are better off using a real assembler to avoid getting bogged down with irrelevancies. You will likely find the GNU assembler got installed along with the rest of the toolchain ;)
The benefit of embedding custom assembly code is that sometimes (dare I say, often times) a developer can write more efficient assembly code than a compiler can. So for extremely performance intensive items, custom written assembly might be beneficial. Games tend to come to mind....
As far as using it to learn assembly, I have no doubt that you could. But, I imagine that using an actual assembly SDK might be a better choice. Aside from the standard experimentation of learning how to use the language, you'd probably want the knowledge around setting up a development environment.
You should not learn assembly language by using the inline asm feature.
Regarding what it's good for, I agree with jldupont, mostly obfuscation. In theory, it allows you to easily integrate with the compiler, because the complex syntax of extended asm allows you to cooperate with the compiler on register usage, and it allows you to tell the compiler that you want this and that to be loaded from memory and placed in registers for you, and finally, it allows the compiler to be warned that you have clobbered this register or that one.
However, all of that could have been done by simply writing standard-conforming C code and then writing an assembler module, and calling the extension as a normal function. Perhaps ages ago the procedure call machine op was too slow to tolerate, but you won't notice today.
I believe the real answer is that it is easier, once you know the contraint DSL. People just throw in an asm and obfuscate the C program rather than go to the trouble of modifying the Makefile and adding a new module to the build and deploy workflow.
This isn't really an answer, but kind of an extended comment on other peoples' answers.
Inline assembly is still used to access CPU features. For instance, in the ARM chips used in cell phones, different manufacturers distinguish their offerings via special features that require unusual machine language instructions that would have no equivalent in C/C++.
Back in the 80s and early 90s, I used inline assembly a lot for optimizing loops. For instance, C compilers targeting 680x0 processors back then would do really stupid things, like:
calculate a value and put it in data register D1
PUSH D1, A7 # Put the value from D1 onto the stack in RAM
POP D1, A7 # Pop it back off again
do something else with the value in D1
But I haven't needed to do that in, oh, probably fifteen years, because modern compilers are much smarter. In fact, current compilers will sometimes generate more efficient code than most humans would. Especially given CPUs with long pipelines, branch prediction, and so on, the fastest-executing sequence of instructions is not always the one that would make most sense to a human. So you can say, "Do A B C D in that order", and the compiler will scramble the order all around for greater efficiency.
Playing a little with inline assembly is fine for starters, but if you're serious, I echo those who suggest you move to a "real" assembler after a while.
Manual optimization of loops that are executed a lot. This article is old, but can give you an idea about the kinds of optimizations hand-coded assembly is used for.
You can also use the assembler gcc uses directly. It's called as (see man as). However, many books and articles on assembly assume you are using a DOS or Windows environment. So it might be kind of hard to learn on Linux (maybe running FreeDOS on a virtual machine), because you not only need to know the processor (you can usually download the official manuals) you code for but also how hook to into the OS you are running.
A nice beginner book using DOS is the one by Norton and Socha. It's pretty old (the 3rd and latest edition is from 1992), so you can get used copies for like $0.01 (no joke). The only book I know of that is specific to Linux is the free "Programming from the Ground Up"

Fortran's performance

Fortran's performances on Computer Language Benchmark Game are surprisingly bad. Today's result puts Fortran 14th and 11th on the two quad-core tests, 7th and 10th on the single cores.
Now, I know benchmarks are never perfect, but still, Fortran was (is?) often considered THE language for high performance computing and it seems like the type of problems used in this benchmark should be to Fortran's advantage. In an recent article on computational physics, Landau (2008) wrote:
However, [Java] is not as efficient or
as well supported for HPC and parallel
processing as are FORTRAN and C, the
latter two having highly developed
compilers and many more scientific
subroutine libraries available.
FORTRAN, in turn, is still the
dominant language for HPC, with
FORTRAN 90/95 being a surprisingly
nice, modern, and effective language;
but alas, it is hardly taught by any
CS departments, and compilers can be
expensive.
Is it only because of the compiler used by the language shootout (Intel's free compiler for Linux) ?
No, this isn't just because of the compiler.
What benchmarks like this -- where the program differs from benchmark to benchmark -- is largely the amount of effort (and quality of effort) that the programmer put into writing any given program. I suspect that Fortran is at a significant disadvantage in that particular metric -- unlike C and C++, the pool of programmers who'd want to try their hand at making the benchmark program better is pretty small, and unlike most anything else, they likely don't feel like they have something to prove either. So, there's no motivation for someone to spend a few days poring over generated assembly code and profiling the program to make it go faster.
This is fairly clear from the results that were obtained. In general, with sufficient programming effort and a decent compiler, neither C, C++, nor Fortran will be significantly slower than assembly code -- certainly not more than 5-10%, at worst, except for pathological cases. The fact that the actual results obtained here are more variant than that indicates to me that "sufficient programming effort" has not been expended.
There are exceptions when you allow the assembly to use vector instructions, but don't allow the C/C++/Fortran to use corresponding compiler intrinsics -- automatic vectorization is not even a close approximation of perfect and probably never will be. I don't know how much those are likely to apply here.
Similarly, an exception is in things like string handling, where you depend heavily on the runtime library (which may be of varying quality; Fortran is rarely a case where a fast string library will make money for the compiler vendor!), and on the basic definition of a "string" and how that's represented in memory.
Some random thoughts:
Fortran used to do very well because it was easier to identify loop invariants which made some optimizations easier for the compiler. Since then
Compilers have gotten much more sophisticated. Enormous effort has been put into c and c++ compilers in particular. Have the fortran compilers kept up? I suppose the gfortran uses the same back end of gcc and g++, but what of the intel compiler? It used to be good, but is it still?
Some languages have gotten a lot specialized keywords and syntax to help the compiler (restricted and const int const *p in c, and inline in c++). Not knowing fortran 90 or 95 I can't say if these have kept pace.
I've looked at these tests. It's not like the compiler is wrong or something. In most tests Fortran is comparable to C++ except some where it gets beaten by a factor of 10. These tests just reflect what one should know from the beggining - that Fortran is simply NOT an all-around interoperable programming language - it is suited for efficient computation, has good list operations & stuff but for example IO sucks unless you are doing it with specific Fortran-like methods - like e.g. 'unformatted' IO.
Let me give you an example - the 'reverse-complement' program that is supposed to read a large (of order of 10^8 B) file from stdin line-by-line, does something with it & prints the resulting large file to stdout. The pretty straighforward Fortran program is about 10 times slower on a single core (~10s) than a HEAVILY optimized C++ (~1s). When you try to play with the program, you'll see that only simple formatted read & write take more than 8 seconds. In a Fortran way, if you care for efficiency, you'd just write an unformatted structure to a file & read it back in no time (which is totally non-portable & stuff but who cares anyway - an efficient code is supposed to be fast & optimized for a specific machine, not able to run everywhere).
So the short answer is - don't worry, just do your job - and if you want to write a super-efficient operating system, than sorry - Fortran is just not the way for that kind of performance.
This benchmark is stupid at all.
For example, they measure CPU-time for the whole program to run. As mcmint stated (and it might be actually true) Fortran I/O sucks*. But who cares? In real-world tasks one read input for some seconds than do calculations for hours/days/months and finally write output for the seconds. Thats why in most benchmarks I/O operations are excluded from time measurements (if you of course do not benchmark I/O by itself).
Norber Wiener in his book God & Golem, Inc. wrote
Render unto man the things which are man’s and unto the computer the things which are the computer’s.
In my opinion the usage of this principle while implementing algorithm in any programming language means:
Write as readable and simple code as you can and let compiler do the optimizations.
Especially it is important in real-world (huge) applications. Dirty tricks (so heavily used in many benchmarks) even if they might improve the efficiency to some extent (5%, maybe 10%) are not for the real-world projects.
/* C/C++ uses stream I/O, but Fortran traditionally uses record-based I/O. Further reading. Anyway I/O in that benchmarks are so surprising. The usage of stdin/stdout redirection might also be the source of problem. Why not simply use the ability of reading/writing files provided by the language or standard library? Once again this woud be more real-world situation.
I would like to say that even if the benchmark do not bring up the best results for FORTRAN, this language will still be used and for a long time. Reasons of use are not just performance but also some kind of thing called easyness of programmability. Lots of people that learnt to use it in the 60's and 70's are now too old for getting into new stuff and they know how to use FORTRAN pretty well. I mean, there are a lot of human factors for a language to be used. The programmer also matters.
Considering they did not publish the exact compiler options they used for the Intel Fortran Compiler, I have little faith in their benchmark.
I would also remark that both Intel's math library, MKL, and AMD's math library, ACML, use the Intel Fortran Compiler.
Edit:
I did find the compilation options when you click on the benchmark's name. The result is surprising since the optimization level seems reasonable. It may come down to the efficiency of the algorithm.

Resources