ARM softfp vs hardfp performance

ARM softfp vs hardfp performance - performance

I have an ARM based platform with a Linux OS. Even though its gcc-based toolchain supports both hardfp and softfp, the vendor recommends using softfp and the platform is shipped with a set of standard and platform-related libraries which have only softfp version.
I'm making a computation-intensive (NEON) AI code based on OpenCV and tensorflow lite. Following the vendor guide, I have built these with softfp option. However, I have a feeling that my code is underperformed compared to other somewhat alike hardfp platforms.
Does the code performance depend on softfp/hardfp setting? Do I understand it right that all .o and .a files the compiler makes to build my program are also using softfp convention, which is less effective? If it does, are there any tricky ways to use hardfp calling convention internally but softfp for external libraries?

Normally, all objects that are linked together need to have the same float ABI. So if you need to use this softfp only library, i'm afraid you have to compile your own software in softfp too.
I had the same question about mixing ABIs. See here
Regarding the performance: the performance lost with softfp compared to hardfp is that you will pass (floating point) function parameters through usual registers instead of using FPU registers. This requires some additional copy between registers. As old_timer said it is impossible to evaluate the performance lost. If you have a single huge function with many float operations, the performance will be the same. If you have many small function calls with many floating variables and few operations, the performance will be dramatically slower.

The softfp option only affects the parameter passing.
In other words, unless you are passing lots of float type arguments while calling functions, there won't be any measurable performance hit compared to hardfp.
And since well designed projects heavily rely on passing pointer to structures instead of many single values, I would stick to softfp.

Related

Do compilers usually emit vector (SIMD) instructions when not explicitly told to do so?

C++17 adds extensions for parallelism to the standard library (e.g. std::sort(std::execution::par_unseq, arr, arr + 1000), which will allow the sort to be done with multiple threads and with vector instructions).
I noticed that Microsoft's experimental implementation mentions that the VC++ compiler lacks support to do vectorization over here, which surprises me - I thought that modern C++ compilers are able to reason about the vectorizability of loops, but apparently the VC++ compiler/optimizer is unable to generate SIMD code even if explicitly told to do so. The seeming lack of automatic vectorization support contradicts the answers for this 2011 question on Quora, which suggests that compilers will do vectorization where possible.
Maybe, compilers will only vectorize very obvious cases such as a std::array<int, 4>, and no more than that, thus C++17's explicit parallelization would be useful.
Hence my question: Do current compilers automatically vectorize my code when not explicitly told to do so? (To make this question more concrete, let's narrow this down to Intel x86 CPUs with SIMD support, and the latest versions of GCC, Clang, MSVC, and ICC.)
As an extension: Do compilers for other languages do better automatic vectorization (maybe due to language design) (so that the C++ standards committee decides it necessary for explicit (C++17-style) vectorization)?

The best compiler for automatically spotting SIMD style vectorisation (when told it can generate opcodes for the appropriate instruction sets of course) is the Intel compiler in my experience (which can generate code to do dynamic dispatch depending on the actual CPU if required), closely followed by GCC and Clang, and MSVC last (of your four).
This is perhaps unsurprising I realise - Intel do have a vested interest in helping developers exploit the latest features they've been adding to their offerings.
I'm working quite closely with Intel and while they are keen to demonstrate how their compiler can spot auto-vectorisation, they also very rightly point out using their compiler also allows you to use pragma simd constructs to further show the compiler assumptions that can or can't be made (that are unclear from a purely syntactic level), and hence allow the compiler to further vectorise the code without resorting to intrinsics.
This, I think, points at the issue with hoping that the compiler (for C++ or another language) will do all the vectorisation work... if you have simple vector processing loops (eg multiply all the elements in a vector by a scalar) then yes, you could expect that 3 of the 4 compilers would spot that.
But for more complicated code, the vectorisation gains that can be had come not from simple loop unwinding and combining iterations, but from actually using a different or tweaked algorithm, and that's going to hard if not impossible for a compiler to do completely alone. Whereas if you understand how vectorisation might be applied to an algorithm, and you can structure your code to allow the compiler to see the opportunities do so, perhaps with pragma simd constructs or OpenMP, then you may get the results you want.
Vectorisation comes when the code has a certain mechanical sympathy for the underlying CPU and memory bus - if you have that then I think the Intel compiler will be your best bet. Without it, changing compilers may make little difference.
Can I recommend Matt Godbolt's Compiler Explorer as a way to actually test this - put your c++ code in there and look at what different compilers actually generate? Very handy... it doesn't include older version of MSVC (I think it currently supports VC++ 2017 and later versions) but will show you what different versions of ICC, GCC, Clang and others can do with code...

How can a compiler be cross platform(hardware)?

I just realized that binary compilers convert source code to the binary of the destination platform. Kind of obvious... but if a compiler works such way, then how can the same compiler be used for different systems like x86, ARM, MIPS, etc?
Shouldn't they be supposed to "know" the machine-language of the hardware platform to be able to know how to build the binary? Does a compiler(like gcc) knows the machine language of every single platform that is supported?
How is that system possible, and how can a compiler be optimized for that many platforms at the same time?

Yes, they have to "know" the machine language for every single platform they support. This is a required to generate machine code. However, compilation is a multi-step process. Usually, the first steps of the compilation are common to most architectures.
Taken from wikipedia
Structure of a compiler
Compilers bridge source programs in high-level
languages with the underlying hardware.
A compiler requires
determining the correctness of the syntax of programs,
generating correct and efficient object code,
run-time organization, and
formatting output according to assembler and/or linker conventions.
A
compiler consists of three main parts: the frontend, the middle-end,
and the backend.
The front end
checks whether the program is correctly
written in terms of the programming language syntax and semantics.
Here legal and illegal programs are recognized. Errors are reported,
if any, in a useful way. Type checking is also performed by collecting
type information. The frontend then generates an intermediate
representation or IR of the source code for processing by the
middle-end.
The middle end
is where optimization takes place. Typical
transformations for optimization are removal of useless or unreachable
code, discovery and propagation of constant values, relocation of
computation to a less frequently executed place (e.g., out of a loop),
or specialization of computation based on the context. The middle-end
generates another IR for the following backend. Most optimization
efforts are focused on this part.
The back end
is responsible for translating the IR from the middle-end into assembly code. The target
instruction(s) are chosen for each IR instruction. Register allocation
assigns processor registers for the program variables where possible.
The backend utilizes the hardware by figuring out how to keep parallel
execution units busy, filling delay slots, and so on. Although most
algorithms for optimization are in NP, heuristic techniques are
well-developed.
More this article which describes the structure of a compiler and on this one which deals with Cross compilers.

The http://llvm.org/ project will answer all of your questions in this regard :)
In a nutshell, cross HW compilers emit "intermediate representation" of the code , which is HW agnostic and then its being customized via the native tool chain

Yes it is possible, it's called Cross Compiler. Compilers usually first they generate the object code which is not understanable by the current machine but it can be migrated to the destiny machine with another compiler. Next, object code is "compiled" again and linked with external libraries of the target machines.
TL;DR: Yes, the compilers knows the target code, but you can compile in another hardware.
I recommend you to read attached links for information.

Every platform has its own toolchain, toolchain includes gcc,gdb,ld,nm etc.
Let's take specific example of gcc as of now. GCC source code has many layers including architecture dependent and independent part. Architecture dependent part contains procedures to handle architecture specific things like their stack, function calls, floating point operations. We need to cross compile the gcc source code for a specific architecture like for ARM. You can see its steps here for reference:- http://www.ailis.de/~k/archives/19-arm-cross-compiling-howto.html#toolchain.
This architecture dependent part is responsible for handling machine language operations.

How does Go compile so quickly?

I've Googled and poked around the Go website, but I can't find an explanation for Go's extraordinary build times. Are they products of the language features (or lack thereof), a highly optimized compiler, or something else? I'm not trying to promote Go; I'm just curious.

Dependency analysis.
The Go FAQ used to contain the following sentence:
Go provides a model for software
construction that makes dependency
analysis easy and avoids much of the
overhead of C-style include files and
libraries.
While the phrase is not in the FAQ anymore, this topic is elaborated upon in the talk Go at Google, which compares the dependency analysis approach of C/C++ and Go.
That is the main reason for fast compilation. And this is by design.

I think it's not that Go compilers are fast, it's that other compilers are slow.
C and C++ compilers have to parse enormous amounts of headers - for example, compiling C++ "hello world" requires compiling 18k lines of code, which is almost half a megabyte of sources!
$ cpp hello.cpp | wc
18364 40513 433334
Java and C# compilers run in a VM, which means that before they can compile anything, the operating system has to load the whole VM, then they have to be JIT-compiled from bytecode to native code, all of which takes some time.
Speed of compilation depends on several factors.
Some languages are designed to be compiled fast. For example, Pascal was designed to be compiled using a single-pass compiler.
Compilers itself can be optimized too. For example, the Turbo Pascal compiler was written in hand-optimized assembler, which, combined with the language design, resulted in a really fast compiler working on 286-class hardware. I think that even now, modern Pascal compilers (e.g. FreePascal) are faster than Go compilers.

There are multiple reasons why the Go compiler is much faster than most C/C++ compilers:
Top reason: Most C/C++ compilers exhibit exceptionally bad designs (from compilation speed perspective). Also, from compilation speed perspective, some parts of the C/C++ ecosystem (such as editors in which programmers are writing their code) aren't designed with speed-of-compilation in mind.
Top reason: Fast compilation speed was a conscious choice in the Go compiler and also in the Go language
The Go compiler has a simpler optimizer than C/C++ compilers
Unlike C++, Go has no templates and no inline functions. This means that Go doesn't need to perform any template or function instantiation.
The Go compiler generates low-level assembly code sooner and the optimizer works on the assembly code, while in a typical C/C++ compiler the optimization passes work on an internal representation of the original source code. The extra overhead in the C/C++ compiler comes from the fact that the internal representation needs to be generated.
Final linking (5l/6l/8l) of a Go program can be slower than linking a C/C++ program, because the Go compiler is going through all of the used assembly code and maybe it is also doing other extra actions that C/C++ linkers aren't doing
Some C/C++ compilers (GCC) generate instructions in text form (to be passed to the assembler), while the Go compiler generates instructions in binary form. Extra work (but not much) needs to be done in order to transform the text into binary.
The Go compiler targets only a small number of CPU architectures, while the GCC compiler targets a large number of CPUs
Compilers which were designed with the goal of high compilation speed, such as Jikes, are fast. On a 2GHz CPU, Jikes can compile 20000+ lines of Java code per second (and the incremental mode of compilation is even more efficient).

Compilation efficiency was a major design goal:
Finally, it is intended to be fast: it should take at most a few seconds to build a large executable on a single computer. To meet these goals required addressing a number of linguistic issues: an expressive but lightweight type system; concurrency and garbage collection; rigid dependency specification; and so on. FAQ
The language FAQ is pretty interesting in regards to specific language features relating to parsing:
Second, the language has been designed to be easy to analyze and can be parsed without a symbol table.

While most of the above is true, there is one very important point that was not really mentionend: Dependency management.
Go only needs to include the packages that you are importing directly (as those already imported what they need). This is in stark contrast to C/C++, where every single file starts including x headers, which include y headers etc. Bottom line: Go's compiling takes linear time w.r.t to the number of imported packages, where C/C++ take exponential time.

A good test for the translation efficiency of a compiler is self-compilation: how long does it take a given compiler to compile itself? For C++ it takes a very long time (hours?). By comparison, a Pascal/Modula-2/Oberon compiler would compile itself in less than one second on a modern machine [1].
Go has been inspired by these languages, but some of the main reasons for this efficiency include:
A clearly defined syntax that is mathematically sound, for efficient scanning and parsing.
A type-safe and statically-compiled language that uses separate compilation with dependency and type checking across module boundaries, to avoid unnecessary re-reading of header files and re-compiling of other modules - as opposed to independent compilation like in C/C++ where no such cross-module checks are performed by the compiler (hence the need to re-read all those header files over and over again, even for a simple one-line "hello world" program).
An efficient compiler implementation (e.g. single-pass, recursive-descent top-down parsing) - which of course is greatly helped by points 1 and 2 above.
These principles have already been known and fully implemented in the 1970s and 1980s in languages like Mesa, Ada, Modula-2/Oberon and several others, and are only now (in the 2010s) finding their way into modern languages like Go (Google), Swift (Apple), C# (Microsoft) and several others.
Let's hope that this will soon be the norm and not the exception. To get there, two things need to happen:
First, software platform providers such as Google, Microsoft and Apple should start by encouraging application developers to use the new compilation methodology, while enabling them to re-use their existing code base. This is what Apple is now trying to do with the Swift programming language, which can co-exist with Objective-C (since it uses the same runtime environment).
Second, the underlying software platforms themselves should eventually be re-written over time using these principles, while simultaneously redesigning the module hierarchy in the process to make them less monolithic. This is of course a mammoth task and may well take the better part of a decade (if they are courageous enough to actually do it - which I am not at all sure in the case of Google).
In any case, it's the platform that drives language adoption, and not the other way around.
References:
[1] http://www.inf.ethz.ch/personal/wirth/ProjectOberon/PO.System.pdf, page 6: "The compiler compiles itself in about 3 seconds". This quote is for a low cost Xilinx Spartan-3 FPGA development board running at a clock frequency of 25 MHz and featuring 1 MByte of main memory. From this one can easily extrapolate to "less than 1 second" for a modern processor running at a clock frequency well above 1 GHz and several GBytes of main memory (i.e. several orders of magnitude more powerful than the Xilinx Spartan-3 FPGA board), even when taking I/O speeds into account. Already back in 1990 when Oberon was run on a 25MHz NS32X32 processor with 2-4 MBytes of main memory, the compiler compiled itself in just a few seconds. The notion of actually waiting for the compiler to finish a compilation cycle was completely unknown to Oberon programmers even back then. For typical programs, it always took more time to remove the finger from the mouse button that triggered the compile command than to wait for the compiler to complete the compilation just triggered. It was truly instant gratification, with near-zero wait times. And the quality of the produced code, even though not always completely on par with the best compilers available back then, was remarkably good for most tasks and quite acceptable in general.

Go was designed to be fast, and it shows.
Dependency Management: no header file, you just need to look at the packages that are directly imported (no need to worry about what they import) thus you have linear dependencies.
Grammar: the grammar of the language is simple, thus easily parsed. Although the number of features is reduced, thus the compiler code itself is tight (few paths).
No overload allowed: you see a symbol, you know which method it refers to.
It's trivially possible to compile Go in parallel because each package can be compiled independently.
Note that Go isn't the only language with such features (modules are the norm in modern languages), but they did it well.

Quoting from the book "The Go Programming Language" by Alan Donovan and Brian Kernighan:
Go compilation is notably faster than most other compiled languages, even when building from scratch. There are three main reasons for the compiler’s speed. First, all imports must be explicitly listed at the beginning of each source file, so the compiler does not have to read and process an entire file to determine its dependencies. Second, the dependencies of a package form a directed acyclic graph, and because there are no cycles, packages can be compiled separately and perhaps in parallel. Finally, the object file for a compiled Go package records export information not just for the package itself, but for its dependencies too. When compiling a package, the compiler must read one object file for each import but need not look beyond these files.

The basic idea of compilation is actually very simple. A recursive-descent parser, in principle, can run at I/O bound speed. Code generation is basically a very simple process. A symbol table and basic type system is not something that requires a lot of computation.
However, it is not hard to slow down a compiler.
If there is a preprocessor phase, with multi-level include directives, macro definitions, and conditional compilation, as useful as those things are, it is not hard to load it down. (For one example, I'm thinking of the Windows and MFC header files.) That is why precompiled headers are necessary.
In terms of optimizing the generated code, there is no limit to how much processing can be added to that phase.

Simply ( in my own words ), because the syntax is very easy ( to analyze and to parse )
For instance, no type inheritance means, not problematic analysis to find out if the new type follows the rules imposed by the base type.
For instance in this code example: "interfaces" the compiler doesn't go and check if the intended type implement the given interface while analyzing that type. Only until it's used ( and IF it is used ) the check is performed.
Other example, the compiler tells you if you're declaring a variable and not using it ( or if you are supposed to hold a return value and you're not )
The following doesn't compile:
package main
func main() {
var a int
a = 0
}
notused.go:3: a declared and not used
This kinds of enforcements and principles make the resulting code safer, and the compiler doesn't have to perform extra validations that the programmer can do.
At large all these details make a language easier to parse which result in fast compilations.
Again, in my own words.

Go imports dependencies once for all files, so the import time doesn't increase exponentially with project size.
Simpler linguistics means interpreting them takes less computing.
What else?

GCC: Inline assembly - good for?

So I just found out GCC could do inline assembly and I was wondering two things:
What's the benefit of being able to inline assembly?
Is it possible to use GCC as an assembly compiler/assembler to learn assembly?
I've found a couple articles but they are all oldish, 2000 and 2001, not really sure of their relevance.
Thanks

The benefit of inline assembly is to have the assembly code, inlined (wait wait, don't kill me). By doing this, you don't have to worry about calling conventions, and you have much more control of the final object file (meaning you can decide where each variable goes- to which register or if it's memory stored), because that code won't be optimized (assuming you use the volatile keyword).
Regarding your second question, yes, it's possible. What you can do is write simple C programs, and then translate them to assembly, using
gcc -S source.c
With this, and the architecture manuals (MIPS, Intel, etc) as well as the GCC manual, you can go a long way.
There's some material online.
http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html
http://gcc.gnu.org/onlinedocs/gcc-4.4.2/gcc/
The downside of inline assembly, is that usually your code will not be portable between different compilers.
Hope it helps.

Inline Assembly is useful for in-place optimizations, and access to CPU features not exposed by any libraries or the operating system.
For example, some applications need strict tracking of timing. On x86 systems, the RDTSC assembly command can be used to read the internal CPU timer.
Time Stamp Counter - Wikipedia
Using GCC or any C/C++ compiler with inline assembly is useful for small snippets of code, but many environments do not have good debugging support- which will be more important when developing projects where inline assembly provides specific functionality. Also, portability will become a recurring issue if you use inline assembly. It is preferable to create specific items in a suitable environment (GNU assembler, MASM) and import them projects as needed.

Inline assembly is generally used to access hardware features not otherwise exposed by the compiler (e.g. vector SIMD instructions where no intrinsics are provided), and/or for hand-optimizing performance critical sections of code where the compiler generates suboptimal code.
Certainly there is nothing to stop you using the inline assembler to test routines you have written in assembly language; however, if you intend to write large sections of code you are better off using a real assembler to avoid getting bogged down with irrelevancies. You will likely find the GNU assembler got installed along with the rest of the toolchain ;)

The benefit of embedding custom assembly code is that sometimes (dare I say, often times) a developer can write more efficient assembly code than a compiler can. So for extremely performance intensive items, custom written assembly might be beneficial. Games tend to come to mind....
As far as using it to learn assembly, I have no doubt that you could. But, I imagine that using an actual assembly SDK might be a better choice. Aside from the standard experimentation of learning how to use the language, you'd probably want the knowledge around setting up a development environment.

You should not learn assembly language by using the inline asm feature.
Regarding what it's good for, I agree with jldupont, mostly obfuscation. In theory, it allows you to easily integrate with the compiler, because the complex syntax of extended asm allows you to cooperate with the compiler on register usage, and it allows you to tell the compiler that you want this and that to be loaded from memory and placed in registers for you, and finally, it allows the compiler to be warned that you have clobbered this register or that one.
However, all of that could have been done by simply writing standard-conforming C code and then writing an assembler module, and calling the extension as a normal function. Perhaps ages ago the procedure call machine op was too slow to tolerate, but you won't notice today.
I believe the real answer is that it is easier, once you know the contraint DSL. People just throw in an asm and obfuscate the C program rather than go to the trouble of modifying the Makefile and adding a new module to the build and deploy workflow.

This isn't really an answer, but kind of an extended comment on other peoples' answers.
Inline assembly is still used to access CPU features. For instance, in the ARM chips used in cell phones, different manufacturers distinguish their offerings via special features that require unusual machine language instructions that would have no equivalent in C/C++.
Back in the 80s and early 90s, I used inline assembly a lot for optimizing loops. For instance, C compilers targeting 680x0 processors back then would do really stupid things, like:
calculate a value and put it in data register D1
PUSH D1, A7 # Put the value from D1 onto the stack in RAM
POP D1, A7 # Pop it back off again
do something else with the value in D1
But I haven't needed to do that in, oh, probably fifteen years, because modern compilers are much smarter. In fact, current compilers will sometimes generate more efficient code than most humans would. Especially given CPUs with long pipelines, branch prediction, and so on, the fastest-executing sequence of instructions is not always the one that would make most sense to a human. So you can say, "Do A B C D in that order", and the compiler will scramble the order all around for greater efficiency.
Playing a little with inline assembly is fine for starters, but if you're serious, I echo those who suggest you move to a "real" assembler after a while.

Manual optimization of loops that are executed a lot. This article is old, but can give you an idea about the kinds of optimizations hand-coded assembly is used for.
You can also use the assembler gcc uses directly. It's called as (see man as). However, many books and articles on assembly assume you are using a DOS or Windows environment. So it might be kind of hard to learn on Linux (maybe running FreeDOS on a virtual machine), because you not only need to know the processor (you can usually download the official manuals) you code for but also how hook to into the OS you are running.
A nice beginner book using DOS is the one by Norton and Socha. It's pretty old (the 3rd and latest edition is from 1992), so you can get used copies for like $0.01 (no joke). The only book I know of that is specific to Linux is the free "Programming from the Ground Up"

How do modern compilers use mmx/3dnow/sse instructions?

I've been reading up on the x86 instruction set extensions, and they only seem useful in some quite specific circumstances (eg HADDPD - (Horizontal-Add-Packed-Double) in SSE3). These require a certain register layout that needs to be either deliberately set up, or occur from the series of instructions before it. How often do general-purpose compilers like gcc actually use these instructions (or a subset thereof), or are they mainly to be used in hand-coded assembler? How does the compiler detect where it is appropriate to use SIMD instructions?

Generally, few compilers use them. GCC and Visual Studio arn't usually able to use the SIMD instructions. If you enable SSE as a compiler flag, it will use the scalar SSE instructions for regular floating-point operations, but generally, don't expect the vectorized ones to be used automatically. Recent versions of GCC might be able to use them in some cases, but didn't work last I tried. Intel's C++ compiler is the only big compiler I know of that is able to auto-vectorize some loops.
In general though, you'll have to use them yourself. Either in raw assembler, or by using compiler intrinsics. In general, I'd say intrinsics are the better approach, since they better allow the compiler to understand the code, and so schedule and optimize, but in practice, I know MSVC at least doesn't always generate very efficient code from intrinsics, so plain asm may be the best solution there. Experiment, see what works. But don't expect the compiler to use these instructions for you, unless you 1) use the right compiler, and 2) write fairly simple loops that can be trivially vectorized.
Update 2012
Ok, so three years have passed since I wrote this answer. GCC has been able to auto-vectorize (simple) code for a couple of years now, and in VS2012, MSVC finally gains the same capability. Of course, the main part of my answer still applies: compilers can still only vectorize fairly trivial code. For anything more complex, you're stuck fiddling with intrinsics or inline asm.

Mono can use SIMD extensions as long as you use its classes for vectors. You can read about it here: http://tirania.org/blog/archive/2008/Nov-03.html
GCC should do some automatic vectorisation as long as you're using -O3 or a specific flag. They have an info page here: http://gcc.gnu.org/projects/tree-ssa/vectorization.html

The question of how to exploit SSE and other small vector units automatically (without direction from the programmer in the form of special language constructs or specially blessed compiler "intrinsics") has been a topic of compiler research for some time. Most results seem to be specialized to a particular problem domain, such as digital signal processing. I have not kept up with the literature on this topic, but what I have read suggests that exploiting the vector (SSE) unit is still a topic for research, and that one should have low expectations of general-purpose compilers commonly used in the field.
Suggested search term: vectorizing compiler

I have seen gcc use sse to zero out a default std::string object. Not a particularly powerful use of sse, but it exists. In most cases, though you will have to write your own.
I know this because I had allowed the stack to become unaligned and it crashed, otherwise I probably wouldn't have noticed!

If you use the vector pascal compiler you will get efficient SIMD code for types for which SIMD gives an advantage. Basically this is anything of length less than 64 bits. ( for 64 bit reals it is actually slower to do SIMD).
Latest versions of the compiler will also automatically parallelise accross cores

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio