suggestion on implementing high performance assembly instructions generator

suggestion on implementing high performance assembly instructions generator - performance

I'm trying write a assembly instruction generator for one operation of a math library. To achieve high performance some operations are needed wrote in assembly language, and vectorization is also necessary. The final goal is to implement a cross-platform generator to automatically generate high performance assembly implementation to save time. For now the abstraction of computation operations is manually, I'm wonder if there are some more automatical way to abstract this operations. Maybe some techs of front-end of compiling will be applicable?

Related

Java vs. Julia: Differences in JIT Compiling and Resulting Performance

I've recently started reading about JIT compilation. On another note, I've read that well-written Julia code often performs on-par with statically compiled languages (see, e.g., paragraph 2 of the introduction section of the Julia docs) while I've recurrently heard Java often does not. Why is that?
On the surface, they seem to have in common that they both run JIT-compiled bytecode in a VM (although I am aware that Java dynamically infers which code to JIT). While I can rationalize the performance difference in Julia vs. (purely) interpreted languages like (vanilla) Python, how come two JIT-compiled languages have such different reputations for performance? Speaking of performance, I am particularly referring to scientific computing applications.
Please note that this question is intentionally phrased broadly. I feel like its possible answers could give me insights into what defines fast Julia code, given the way Julia's compiler works in comparison to other JIT compiled languages.

While AFAIK there is currently one implementation of Julia, there are several implementations of Java and not all behave the same nor use the same technics internally. Thus it does not mean much to compare languages. For example, GCJ is a GNU compiler meant to compile Java codes to native ones (ie. no JIT nor bytecode). It is now a defunct project since the open-source JIT-based implementations super-seeded this project (AFAIK even performance-wise).
The primary reference Java VM implementation is HotSpot (made by Oracle). The JIT of HotSpot use an adaptative strategy for compiling functions so to reduce the latency of the compilation. The code can be interpreted for a short period of time and if it is executed many times, then the JIT use more aggressive optimizations with multiple levels. As a result hot loops are very aggressively optimized while glue code executed once is mostly interpreted. Meanwhile, Julia is based on the LLVM compiler stack capable of producing very efficient code (it is used by Clang which is a compiler used to compile C/C++ code to native one), but it is also not yet very well suited for very dynamic codes (it works but the latency is pretty big compared to other existing JIT implementations).
The thing is Java and Julia target different domains. Java is used for example on embedded systems where latency matters a lot. It is also use for GUI applications and Web servers. Introducing a high latency during the execution is not very reasonable. This is especially why Java implementation spent a lot of time in the past so to optimize the GC (Garbage Collector) in order to reduce the latency of collections. Julia mainly target HPC/scientific applications that do not care much about latency. The main goal of Julia is to minimize the wall-clock time and not the responsiveness of the application.
I've read that well-written Julia code often performs on-par with statically compiled languages
Well, optimizing JITs like the one of Julia or the one of HotSpot are very good nowadays to compile scalar codes in hot loops. Their weakness lies in the capability to apply high-level expensive computations. For example, optimizing compilers like ICC/PGI can use the polyhedral model so to completely rewrite loops and vectorize them efficiently using SIMD instructions. This is frequent in HPC (numerically intensive) applications but very rare in embedded/Web/GUI ones. The use of the best specific instructions on the available platform is not always great in most JIT implementations (eg. bit operations) though the situation is rapidly improving. On the other hand, JIT can outperform static compilers by using runtime informations. For example, they can assume a value is a constant and optimize expressions based on that (eg. a runtime-dependent stride of 1 of a multi-dimensional array do not need additional multiplications). Still, static compilers can do similar optimisation with profile-guided optimizations (unfortunately rarely used in practice).
However, there is a catch: languages likes C/C++ compiled natively have access to lower-level features barely available in Java. This is a bit different in Julia since the link with native language code is easier and inline assembly is possible (as pointed out by #OscarSmith) enabling skilled developers to write efficient wrappers. Julia and Java use a GC that can speed up a bit some unoptimized codes but also slow down a lot some others (typically code manipulating big data-structures with a lot of references likes trees and graphs, especially in parallel codes). This is why a C/C++ code can significantly outperform a Julia/Java code. While JIT implementations can sometime (but rarely) outperform static C/C++ compilers, no compilers are perfect and there are case where nearly all implementations perform poorly. The use of (low-level) intrinsics enable developers to implement possibly faster codes at the expense of a lower portability and a higher complexity. This is not rare for SIMD code since auto-vectorization is far from being great so far. More generally, the access to lower-level features (eg. operating system specific functions, parallel tools) help to write faster codes for skilled programmers.
Chosen algorithms and methods matters often far more than the target language implementation. The best algorithm/method in one language implementation may not be the best in another. Two best algorithms/methods of two different implementation are generally hard to compare (it is fair to compare only the performance of codes if one is is nearly impossible to maintain and is very hard/long to write without bugs?). This is partially why comparing language implementation is so difficult, even on a specific problem to solve.
(purely) interpreted languages like (vanilla) Python
While the standard implementation of Python is the CPython interpreter, there are fast JIT for Python like PyPy or Pyston.
Speaking of performance, I am particularly referring to scientific computing applications
Note that scientific computing applications is still quite broad. While physicist tends to write heavily numerically intensive applications operating on large contiguous arrays where the use of multiple threads and SIMD instruction is critical, biologist tends to write codes requiring very different optimizations. For example, genomic codes tends to do a lot of string matching operations. They also often make use of complex data-structures/algorithms (eg. phylogenetic tree, compression).
Some Java features like boxing are performance killers for such applications though there are often complex way to mitigate their cost.
You may be interested by this famous language benchmark:
Julia VS C-GCC (one can see that Julia and Java are slow for binary trees, as expected, certainly due to the GC, though the Java's GC is more efficient at the expense of a bigger memory usage)
Julia VS Java-OpenJDK
C-GCC VS C-Clang
As you can see in the benchmark, the fastest implementations are generally the more-complex and/or bigger ones using the best algorithms and lower-level methods/tricks.

How can I find out the time required to perform of a particular instruction for Xtensa microprocessor for e.g. wsr / rsr?

I am trying to optimize a code on esp32 which uses xtensa LX6 microprocessors , I wanted to know the cost of wsr and rsr instructions which are used to read or wirte in the special registers .

First of all, optimize only after you have profiled, and came to the conclusion that this is your bottleneck.
In rare cases (like a function that accesses registers) it might be a good idea to optimize code generated by the compiler, but usually, that is not where the bottleneck is.
In general, when optimizing compiler-generated code:
write a very simple function (that produces the asm you think you can optimize)
create TESTS for this function (people tend to introduce weird bugs when writing their own asm)
run & measure
replace the generated asm with your optimizations
test you haven't screwed up the logic
run & measure again
Even if you managed to optimize, think about readability, portability, maintenance, etc. before choosing your optimized version.

Is There a Way of Providing asm.js or WebAssembly Code to V8 Turbofan?

After looking into the recently announce support for WebAssembly, it occurs to me that it would dramatically increase its utility if there were some way to:
Have TurboFan, the successor to the V8 JIT Crankshaft optimizer output all the assembly code it generates along with the static type signatures, and execution profile of that generated code.
Permit the programmer to provide his own asm.js/WebAssembly code for specific static type signatures that override the optimizer.
Is there some way to do this already?
There is some indication that it may be from the following passage from this article:
Under the hood, the WebAssembly implementation in V8 is designed to
reuse much of the existing JavaScript virtual machine infrastructure,
specifically the TurboFan compiler. A specialized WebAssembly decoder
validates modules by checking types, local variable indices, function
references, return values, and control flow structure in a single
pass. The decoder produces a TurboFan graph which is processed by
various optimization passes and finally turned into machine code by
the same backend which generates machine code for optimized JavaScript
and asm.js. In the next few months, the team will concentrate on
improving the startup time of the V8 implementation through compiler
tuning, parallelism, and compilation policy improvements.
To expand on the idea for a more general audience:
Typical top-down optimization involves high level programming and then execution profiling to identify which pieces of code require more effort. This is true whether the optimization is automated code generation or manual coding of optimized code. In the case of dynamically typed languages you'll frequently want to go beyond just optimizing dynamically-typed algorithms and provide code specialized for specific static types. This is, in fact, what the V8 JIT optimizer does automatically. If humans want to manually provide some particularly 'hot' specialized cases, they'd need to inform the automated optimizer, somehow, that they have already done the work so the automated optimizer can incorporate the manually optimized code rather than automatically generating suboptimal code.

No, that's not possible, and it's highly unlikely that it ever will be, given that it would probably require piercing all sorts of abstraction barriers within the system. The complexity would be enormous, and the effect on maintainability and security would probably be severe.
The web interface to WebAssembly modules (through the Wasm object) provides a clean and simple way to interface between JS and Wasm. In the future, ES6 modules might simplify interop further. It's not obvious what advantage a complicated mechanism like you propose would have over that.

For 1. you can play with the following flags:
trace_turbo: trace generated TurboFan IR
trace_turbo_graph: trace generated TurboFan graphs
trace_turbo_cfg_file: trace turbo cfg graph (for C1 visualizer) to a given file name
trace_turbo_types: trace TurboFan's types
trace_turbo_scheduler: trace TurboFan's scheduler
trace_turbo_reduction: trace TurboFan's various reducers
trace_turbo_jt: trace TurboFan's jump threading
trace_turbo_ceq: trace TurboFan's control equivalence
turbo_stats: print TurboFan statistics
They may change in future versions of V8 and aren't a stable API.
TurboFan is pretty complicated in that it consumes information from the baseline JIT / the interpreter, and may get to that information after deopt. The compiler isn't always a straight pipeline from JS / wasm to assembly. Inlining and a bunch of other things affect what happens.
For 2.: write wasm code or valid asm.js in the first place.
We've discussed performing a bunch of different types of dynamic tracing, caching traces (and allowing injection of traces for testing), but that's probably not something we'd expose considering that there's already a way to give the compiler precise type information!

Assembly Analysis Tools

Does anyone have any suggestions for assembly file analysis tools? I'm attempting to analyze ARM/Thumb-2 ASM files generated by LLVM (or alternatively GCC) when passed the -S option. I'm particularly interested in instruction statistics at the basic block level, e.g. memory operation counts, etc. I may wind up rolling my own tool in Python, but was curious to see if there were any existing tools before I started.
Update: I've done a little searching, and found a good resource for disassembly tools / hex editors / etc here, but unfortunately it is mainly focused on x86 assembly, and also doesn't include any actual assembly file analyzers.

What you need is a tool for which you can define an assembly language syntax, and then build custom analyzers. You analyzers might be simple ("how much space does an instruction take?") or complex ("How many cycles will this isntruction take to execute?" [which depends on the preceding sequence of instructions and possibly a sophisticated model of the processor you care about]).
One designed specifically to do that is the New Jersey Machine Toolkit. It is really designed to build code generators and debuggers. I suspect it would be good at "instruction byte count". It isn't clear it is good at more sophisticated analyses. And I believe it insists you follow its syntax style, rather than yours.
One not designed specifically to do that, but good at parsing/analyzing langauges in general is our
DMS Software Reengineering Toolkit.
DMS can be given a grammar description for virtually any context free language (that covers most assembly language syntax) and can then parse a specific instance of that grammar (assembly code) into ASTs for further processing. We've done with with several assembly langauges, including the IBM 370, Motorola's 8 bit CPU line, and a rather peculiar DSP, without trouble.
You can specify an attribute grammar (computation over an AST) to DMS easily. These are great way to encode analyses that need just local information, such as "How big is this instruction?". For more complex analysese, you'll need a processor model that is driven from a series of instructions; passing such a machine model the ASTs for individual instructions would be an easy way to apply a machine model to compute more complex things as "How long does this instruction take?".
Other analyses such as control flow and data flow, are provided in generic form by DMS. You can use an attribute evaluator to collect local facts ("control-next for this instruction is...", "data from this instruction flows to,...") and feed them to the flow analyzers to compute global flow facts ("if I execute this instruction, what other instructions might be executed downstream?"..)
You do have to configure DMS for your particular (assembly) language. It is designed to be configured for tasks like these.
Yes, you can likely code all this in Python; after all, its a Turing machine. But likely not nearly as easily.
An additional benefit: DMS is willing to apply transformations to your code, based on your analyses. So you could implement your optimizer with it, too. After all, you need to connect the analysis indication the optimization is safe, to the actual optimization steps.

I have written many disassemblers, including arm and thumb. Not production quality but for the purposes of learning the assembler. For both the ARM and Thumb the ARM ARM (ARM Architectural Reference Manual) has a nice chart from which you can easily count up data operations from load/store, etc. maybe an hours worth of work, maybe two. At least up front, you would end up with data values being counted though.
The other poster may be right, as with the chart I am talking about it should be very simple to write a program to examine the ASCII looking for ldr, str, add, etc. No need to parse everything if you are interested in memory operations counts, etc. Of course the downside is that you are likely not going to be able to examine loops. One function may have a load and store, another may have a load and store but have it wrapped by a loop, causing many more memory operations once executed.
Not knowing what you really are interested in, my guess is you might want to simulate the code and count these sorts of things. I wrote a thumb simulator (thumbulator) that attempts to do just that. (and I have used it to compare llvm execution vs gcc execution when it comes to number of instructions executed, fetches, memory operations, etc) The problem may be that it is thumb only, no ARM no Thumb2. Thumb2 could be added easier than ARM. There exists an armulator from arm, which is in the gdb sources among other places. I cant remember now if it executes thumb2. My understanding is that when arm was using it would accurately tell you these sorts of statistics.

You can plug your statistics into LLVM code generator, it's quite flexible and it is already collecting some stats, which could be used as an example.

How does Go compile so quickly?

I've Googled and poked around the Go website, but I can't find an explanation for Go's extraordinary build times. Are they products of the language features (or lack thereof), a highly optimized compiler, or something else? I'm not trying to promote Go; I'm just curious.

Dependency analysis.
The Go FAQ used to contain the following sentence:
Go provides a model for software
construction that makes dependency
analysis easy and avoids much of the
overhead of C-style include files and
libraries.
While the phrase is not in the FAQ anymore, this topic is elaborated upon in the talk Go at Google, which compares the dependency analysis approach of C/C++ and Go.
That is the main reason for fast compilation. And this is by design.

I think it's not that Go compilers are fast, it's that other compilers are slow.
C and C++ compilers have to parse enormous amounts of headers - for example, compiling C++ "hello world" requires compiling 18k lines of code, which is almost half a megabyte of sources!
$ cpp hello.cpp | wc
18364 40513 433334
Java and C# compilers run in a VM, which means that before they can compile anything, the operating system has to load the whole VM, then they have to be JIT-compiled from bytecode to native code, all of which takes some time.
Speed of compilation depends on several factors.
Some languages are designed to be compiled fast. For example, Pascal was designed to be compiled using a single-pass compiler.
Compilers itself can be optimized too. For example, the Turbo Pascal compiler was written in hand-optimized assembler, which, combined with the language design, resulted in a really fast compiler working on 286-class hardware. I think that even now, modern Pascal compilers (e.g. FreePascal) are faster than Go compilers.

There are multiple reasons why the Go compiler is much faster than most C/C++ compilers:
Top reason: Most C/C++ compilers exhibit exceptionally bad designs (from compilation speed perspective). Also, from compilation speed perspective, some parts of the C/C++ ecosystem (such as editors in which programmers are writing their code) aren't designed with speed-of-compilation in mind.
Top reason: Fast compilation speed was a conscious choice in the Go compiler and also in the Go language
The Go compiler has a simpler optimizer than C/C++ compilers
Unlike C++, Go has no templates and no inline functions. This means that Go doesn't need to perform any template or function instantiation.
The Go compiler generates low-level assembly code sooner and the optimizer works on the assembly code, while in a typical C/C++ compiler the optimization passes work on an internal representation of the original source code. The extra overhead in the C/C++ compiler comes from the fact that the internal representation needs to be generated.
Final linking (5l/6l/8l) of a Go program can be slower than linking a C/C++ program, because the Go compiler is going through all of the used assembly code and maybe it is also doing other extra actions that C/C++ linkers aren't doing
Some C/C++ compilers (GCC) generate instructions in text form (to be passed to the assembler), while the Go compiler generates instructions in binary form. Extra work (but not much) needs to be done in order to transform the text into binary.
The Go compiler targets only a small number of CPU architectures, while the GCC compiler targets a large number of CPUs
Compilers which were designed with the goal of high compilation speed, such as Jikes, are fast. On a 2GHz CPU, Jikes can compile 20000+ lines of Java code per second (and the incremental mode of compilation is even more efficient).

Compilation efficiency was a major design goal:
Finally, it is intended to be fast: it should take at most a few seconds to build a large executable on a single computer. To meet these goals required addressing a number of linguistic issues: an expressive but lightweight type system; concurrency and garbage collection; rigid dependency specification; and so on. FAQ
The language FAQ is pretty interesting in regards to specific language features relating to parsing:
Second, the language has been designed to be easy to analyze and can be parsed without a symbol table.

While most of the above is true, there is one very important point that was not really mentionend: Dependency management.
Go only needs to include the packages that you are importing directly (as those already imported what they need). This is in stark contrast to C/C++, where every single file starts including x headers, which include y headers etc. Bottom line: Go's compiling takes linear time w.r.t to the number of imported packages, where C/C++ take exponential time.

A good test for the translation efficiency of a compiler is self-compilation: how long does it take a given compiler to compile itself? For C++ it takes a very long time (hours?). By comparison, a Pascal/Modula-2/Oberon compiler would compile itself in less than one second on a modern machine [1].
Go has been inspired by these languages, but some of the main reasons for this efficiency include:
A clearly defined syntax that is mathematically sound, for efficient scanning and parsing.
A type-safe and statically-compiled language that uses separate compilation with dependency and type checking across module boundaries, to avoid unnecessary re-reading of header files and re-compiling of other modules - as opposed to independent compilation like in C/C++ where no such cross-module checks are performed by the compiler (hence the need to re-read all those header files over and over again, even for a simple one-line "hello world" program).
An efficient compiler implementation (e.g. single-pass, recursive-descent top-down parsing) - which of course is greatly helped by points 1 and 2 above.
These principles have already been known and fully implemented in the 1970s and 1980s in languages like Mesa, Ada, Modula-2/Oberon and several others, and are only now (in the 2010s) finding their way into modern languages like Go (Google), Swift (Apple), C# (Microsoft) and several others.
Let's hope that this will soon be the norm and not the exception. To get there, two things need to happen:
First, software platform providers such as Google, Microsoft and Apple should start by encouraging application developers to use the new compilation methodology, while enabling them to re-use their existing code base. This is what Apple is now trying to do with the Swift programming language, which can co-exist with Objective-C (since it uses the same runtime environment).
Second, the underlying software platforms themselves should eventually be re-written over time using these principles, while simultaneously redesigning the module hierarchy in the process to make them less monolithic. This is of course a mammoth task and may well take the better part of a decade (if they are courageous enough to actually do it - which I am not at all sure in the case of Google).
In any case, it's the platform that drives language adoption, and not the other way around.
References:
[1] http://www.inf.ethz.ch/personal/wirth/ProjectOberon/PO.System.pdf, page 6: "The compiler compiles itself in about 3 seconds". This quote is for a low cost Xilinx Spartan-3 FPGA development board running at a clock frequency of 25 MHz and featuring 1 MByte of main memory. From this one can easily extrapolate to "less than 1 second" for a modern processor running at a clock frequency well above 1 GHz and several GBytes of main memory (i.e. several orders of magnitude more powerful than the Xilinx Spartan-3 FPGA board), even when taking I/O speeds into account. Already back in 1990 when Oberon was run on a 25MHz NS32X32 processor with 2-4 MBytes of main memory, the compiler compiled itself in just a few seconds. The notion of actually waiting for the compiler to finish a compilation cycle was completely unknown to Oberon programmers even back then. For typical programs, it always took more time to remove the finger from the mouse button that triggered the compile command than to wait for the compiler to complete the compilation just triggered. It was truly instant gratification, with near-zero wait times. And the quality of the produced code, even though not always completely on par with the best compilers available back then, was remarkably good for most tasks and quite acceptable in general.

Go was designed to be fast, and it shows.
Dependency Management: no header file, you just need to look at the packages that are directly imported (no need to worry about what they import) thus you have linear dependencies.
Grammar: the grammar of the language is simple, thus easily parsed. Although the number of features is reduced, thus the compiler code itself is tight (few paths).
No overload allowed: you see a symbol, you know which method it refers to.
It's trivially possible to compile Go in parallel because each package can be compiled independently.
Note that Go isn't the only language with such features (modules are the norm in modern languages), but they did it well.

Quoting from the book "The Go Programming Language" by Alan Donovan and Brian Kernighan:
Go compilation is notably faster than most other compiled languages, even when building from scratch. There are three main reasons for the compiler’s speed. First, all imports must be explicitly listed at the beginning of each source file, so the compiler does not have to read and process an entire file to determine its dependencies. Second, the dependencies of a package form a directed acyclic graph, and because there are no cycles, packages can be compiled separately and perhaps in parallel. Finally, the object file for a compiled Go package records export information not just for the package itself, but for its dependencies too. When compiling a package, the compiler must read one object file for each import but need not look beyond these files.

The basic idea of compilation is actually very simple. A recursive-descent parser, in principle, can run at I/O bound speed. Code generation is basically a very simple process. A symbol table and basic type system is not something that requires a lot of computation.
However, it is not hard to slow down a compiler.
If there is a preprocessor phase, with multi-level include directives, macro definitions, and conditional compilation, as useful as those things are, it is not hard to load it down. (For one example, I'm thinking of the Windows and MFC header files.) That is why precompiled headers are necessary.
In terms of optimizing the generated code, there is no limit to how much processing can be added to that phase.

Simply ( in my own words ), because the syntax is very easy ( to analyze and to parse )
For instance, no type inheritance means, not problematic analysis to find out if the new type follows the rules imposed by the base type.
For instance in this code example: "interfaces" the compiler doesn't go and check if the intended type implement the given interface while analyzing that type. Only until it's used ( and IF it is used ) the check is performed.
Other example, the compiler tells you if you're declaring a variable and not using it ( or if you are supposed to hold a return value and you're not )
The following doesn't compile:
package main
func main() {
var a int
a = 0
}
notused.go:3: a declared and not used
This kinds of enforcements and principles make the resulting code safer, and the compiler doesn't have to perform extra validations that the programmer can do.
At large all these details make a language easier to parse which result in fast compilations.
Again, in my own words.

Go imports dependencies once for all files, so the import time doesn't increase exponentially with project size.
Simpler linguistics means interpreting them takes less computing.
What else?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio