Utilizing CLMUL in SICStus Prolog

Utilizing CLMUL in SICStus Prolog - prolog

How can I get the SICStus Prolog JIT to use any of the following ISA?
Intel BMI:
POPCNT, LZCNT, TZCNT, PDEP, PEXT
Intel CLMUL: PCLMULQDQ
ARM AArch64: RBIT
I need them for supercharging clpz. Right now, I got:
http://www.hackersdelight.org/ and
the non-ISO arithmetic function msb/1.
For a start that's good, but I want more. Please help!

Unfortunately, there is no way for users to extend the JIT for cases like this.
I have been thinking about accessing the population count instructions (for some unrelated uses) from Prolog. The way to add this and other similar instructions would be:
Add a new arithmetic instruction to is/2. This needs to be supported by all our code, not just JIT-compiled code, so interpreter, WAM-emulator, various internal byte-code-processors, all the static analyzers in our IDE, etc. etc.
Add JIT-compilation that just calls back into the corresponding C routine in the runtime system.
If it can be demonstrated to benefit performance sufficiently, make the JIT compiler emit the special purpose CPU instructions for targets that have them.
(1) requires sufficient user demand (or explicit financing, of course). (3) requires convincing benchmarks. Currently neither of these are available, but that could change, of course.

Related

Do compilers usually emit vector (SIMD) instructions when not explicitly told to do so?

C++17 adds extensions for parallelism to the standard library (e.g. std::sort(std::execution::par_unseq, arr, arr + 1000), which will allow the sort to be done with multiple threads and with vector instructions).
I noticed that Microsoft's experimental implementation mentions that the VC++ compiler lacks support to do vectorization over here, which surprises me - I thought that modern C++ compilers are able to reason about the vectorizability of loops, but apparently the VC++ compiler/optimizer is unable to generate SIMD code even if explicitly told to do so. The seeming lack of automatic vectorization support contradicts the answers for this 2011 question on Quora, which suggests that compilers will do vectorization where possible.
Maybe, compilers will only vectorize very obvious cases such as a std::array<int, 4>, and no more than that, thus C++17's explicit parallelization would be useful.
Hence my question: Do current compilers automatically vectorize my code when not explicitly told to do so? (To make this question more concrete, let's narrow this down to Intel x86 CPUs with SIMD support, and the latest versions of GCC, Clang, MSVC, and ICC.)
As an extension: Do compilers for other languages do better automatic vectorization (maybe due to language design) (so that the C++ standards committee decides it necessary for explicit (C++17-style) vectorization)?

The best compiler for automatically spotting SIMD style vectorisation (when told it can generate opcodes for the appropriate instruction sets of course) is the Intel compiler in my experience (which can generate code to do dynamic dispatch depending on the actual CPU if required), closely followed by GCC and Clang, and MSVC last (of your four).
This is perhaps unsurprising I realise - Intel do have a vested interest in helping developers exploit the latest features they've been adding to their offerings.
I'm working quite closely with Intel and while they are keen to demonstrate how their compiler can spot auto-vectorisation, they also very rightly point out using their compiler also allows you to use pragma simd constructs to further show the compiler assumptions that can or can't be made (that are unclear from a purely syntactic level), and hence allow the compiler to further vectorise the code without resorting to intrinsics.
This, I think, points at the issue with hoping that the compiler (for C++ or another language) will do all the vectorisation work... if you have simple vector processing loops (eg multiply all the elements in a vector by a scalar) then yes, you could expect that 3 of the 4 compilers would spot that.
But for more complicated code, the vectorisation gains that can be had come not from simple loop unwinding and combining iterations, but from actually using a different or tweaked algorithm, and that's going to hard if not impossible for a compiler to do completely alone. Whereas if you understand how vectorisation might be applied to an algorithm, and you can structure your code to allow the compiler to see the opportunities do so, perhaps with pragma simd constructs or OpenMP, then you may get the results you want.
Vectorisation comes when the code has a certain mechanical sympathy for the underlying CPU and memory bus - if you have that then I think the Intel compiler will be your best bet. Without it, changing compilers may make little difference.
Can I recommend Matt Godbolt's Compiler Explorer as a way to actually test this - put your c++ code in there and look at what different compilers actually generate? Very handy... it doesn't include older version of MSVC (I think it currently supports VC++ 2017 and later versions) but will show you what different versions of ICC, GCC, Clang and others can do with code...

Default GCC optimization options for a specific architecture

our compilers course features exercises asking us to compare code built with the -O and -O3 gcc options. The code generated by my machine isn't the same as the code in the course. Is there a way to figure the optimization options used in the course, in order to obtain the same code on my machine, and make more meaningful observations?
I found how to get the optimization options on my machine :
$ gcc -O3 -Q --help=optimizer
But is there a way to deduce those on the machine of the professor except by trying them all and modifying them one by one (.ident "GCC: (Debian 4.3.2-1.1) 4.3.2")?
Thanks for your attention.
Edit:
I noticed that the code generated on my machine lacks the prologue and epilogue generated on my professor's. Is there an option to force prologue generation (google doesn't seem to bring much)?

Here's what you need to know about compiler optimizations : they are architecture dependent. Also, they're mainly different from one version of the compiler to another (gcc-4.9 does more stuff by default than gcc-4.4).
By architecture, I mean CPU micro architecture (Intel : Nehalem, Sandy bridge, Ivy Bridge, Haswell, KNC ... AMD : Bobcat, Bulldozzer, Jaguar, ...). Compilers usually convert input code (C, C++, ADA, ...) into a CPU-agnostic intermediary representation (GIMPLE for GCC) on which a large number of optimizations will be performed. After that, the compiler will generate a lower level representation closer to assembly. On the latter, architecture specific optimizations will be unrolled. Such optimizations include the choice of instructions with the lowest latencies, determining loop unroll factors depending on the loop size, the instruction cache size, and so on.
Since your generated code is different from the one you got in class, I suppose the underlying architectures must be different. In this case, even with the same compiler flags you won't be able to get the same assembly code (even with no optimizations you'll get different assembly codes).
For that, you should concentrate on comparing the optimized and non-optimized codes rather than trying to stick to what you were given in class. I even think that it's a great reverse engineering exercise to compare your optimized code to the one you were given.
You can find one of my earlier posts about compiler optimizations in here.
Two great books on the subject are The Dragon Book (Compilers: Principles, Techniques, and Tools) by Aho, Seti, and Ulman, and also Engineering a Compiler by Keith Cooper, and Linda Torczon.

Assembly Analysis Tools

Does anyone have any suggestions for assembly file analysis tools? I'm attempting to analyze ARM/Thumb-2 ASM files generated by LLVM (or alternatively GCC) when passed the -S option. I'm particularly interested in instruction statistics at the basic block level, e.g. memory operation counts, etc. I may wind up rolling my own tool in Python, but was curious to see if there were any existing tools before I started.
Update: I've done a little searching, and found a good resource for disassembly tools / hex editors / etc here, but unfortunately it is mainly focused on x86 assembly, and also doesn't include any actual assembly file analyzers.

What you need is a tool for which you can define an assembly language syntax, and then build custom analyzers. You analyzers might be simple ("how much space does an instruction take?") or complex ("How many cycles will this isntruction take to execute?" [which depends on the preceding sequence of instructions and possibly a sophisticated model of the processor you care about]).
One designed specifically to do that is the New Jersey Machine Toolkit. It is really designed to build code generators and debuggers. I suspect it would be good at "instruction byte count". It isn't clear it is good at more sophisticated analyses. And I believe it insists you follow its syntax style, rather than yours.
One not designed specifically to do that, but good at parsing/analyzing langauges in general is our
DMS Software Reengineering Toolkit.
DMS can be given a grammar description for virtually any context free language (that covers most assembly language syntax) and can then parse a specific instance of that grammar (assembly code) into ASTs for further processing. We've done with with several assembly langauges, including the IBM 370, Motorola's 8 bit CPU line, and a rather peculiar DSP, without trouble.
You can specify an attribute grammar (computation over an AST) to DMS easily. These are great way to encode analyses that need just local information, such as "How big is this instruction?". For more complex analysese, you'll need a processor model that is driven from a series of instructions; passing such a machine model the ASTs for individual instructions would be an easy way to apply a machine model to compute more complex things as "How long does this instruction take?".
Other analyses such as control flow and data flow, are provided in generic form by DMS. You can use an attribute evaluator to collect local facts ("control-next for this instruction is...", "data from this instruction flows to,...") and feed them to the flow analyzers to compute global flow facts ("if I execute this instruction, what other instructions might be executed downstream?"..)
You do have to configure DMS for your particular (assembly) language. It is designed to be configured for tasks like these.
Yes, you can likely code all this in Python; after all, its a Turing machine. But likely not nearly as easily.
An additional benefit: DMS is willing to apply transformations to your code, based on your analyses. So you could implement your optimizer with it, too. After all, you need to connect the analysis indication the optimization is safe, to the actual optimization steps.

I have written many disassemblers, including arm and thumb. Not production quality but for the purposes of learning the assembler. For both the ARM and Thumb the ARM ARM (ARM Architectural Reference Manual) has a nice chart from which you can easily count up data operations from load/store, etc. maybe an hours worth of work, maybe two. At least up front, you would end up with data values being counted though.
The other poster may be right, as with the chart I am talking about it should be very simple to write a program to examine the ASCII looking for ldr, str, add, etc. No need to parse everything if you are interested in memory operations counts, etc. Of course the downside is that you are likely not going to be able to examine loops. One function may have a load and store, another may have a load and store but have it wrapped by a loop, causing many more memory operations once executed.
Not knowing what you really are interested in, my guess is you might want to simulate the code and count these sorts of things. I wrote a thumb simulator (thumbulator) that attempts to do just that. (and I have used it to compare llvm execution vs gcc execution when it comes to number of instructions executed, fetches, memory operations, etc) The problem may be that it is thumb only, no ARM no Thumb2. Thumb2 could be added easier than ARM. There exists an armulator from arm, which is in the gdb sources among other places. I cant remember now if it executes thumb2. My understanding is that when arm was using it would accurately tell you these sorts of statistics.

You can plug your statistics into LLVM code generator, it's quite flexible and it is already collecting some stats, which could be used as an example.

GCC: Inline assembly - good for?

So I just found out GCC could do inline assembly and I was wondering two things:
What's the benefit of being able to inline assembly?
Is it possible to use GCC as an assembly compiler/assembler to learn assembly?
I've found a couple articles but they are all oldish, 2000 and 2001, not really sure of their relevance.
Thanks

The benefit of inline assembly is to have the assembly code, inlined (wait wait, don't kill me). By doing this, you don't have to worry about calling conventions, and you have much more control of the final object file (meaning you can decide where each variable goes- to which register or if it's memory stored), because that code won't be optimized (assuming you use the volatile keyword).
Regarding your second question, yes, it's possible. What you can do is write simple C programs, and then translate them to assembly, using
gcc -S source.c
With this, and the architecture manuals (MIPS, Intel, etc) as well as the GCC manual, you can go a long way.
There's some material online.
http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html
http://gcc.gnu.org/onlinedocs/gcc-4.4.2/gcc/
The downside of inline assembly, is that usually your code will not be portable between different compilers.
Hope it helps.

Inline Assembly is useful for in-place optimizations, and access to CPU features not exposed by any libraries or the operating system.
For example, some applications need strict tracking of timing. On x86 systems, the RDTSC assembly command can be used to read the internal CPU timer.
Time Stamp Counter - Wikipedia
Using GCC or any C/C++ compiler with inline assembly is useful for small snippets of code, but many environments do not have good debugging support- which will be more important when developing projects where inline assembly provides specific functionality. Also, portability will become a recurring issue if you use inline assembly. It is preferable to create specific items in a suitable environment (GNU assembler, MASM) and import them projects as needed.

Inline assembly is generally used to access hardware features not otherwise exposed by the compiler (e.g. vector SIMD instructions where no intrinsics are provided), and/or for hand-optimizing performance critical sections of code where the compiler generates suboptimal code.
Certainly there is nothing to stop you using the inline assembler to test routines you have written in assembly language; however, if you intend to write large sections of code you are better off using a real assembler to avoid getting bogged down with irrelevancies. You will likely find the GNU assembler got installed along with the rest of the toolchain ;)

The benefit of embedding custom assembly code is that sometimes (dare I say, often times) a developer can write more efficient assembly code than a compiler can. So for extremely performance intensive items, custom written assembly might be beneficial. Games tend to come to mind....
As far as using it to learn assembly, I have no doubt that you could. But, I imagine that using an actual assembly SDK might be a better choice. Aside from the standard experimentation of learning how to use the language, you'd probably want the knowledge around setting up a development environment.

You should not learn assembly language by using the inline asm feature.
Regarding what it's good for, I agree with jldupont, mostly obfuscation. In theory, it allows you to easily integrate with the compiler, because the complex syntax of extended asm allows you to cooperate with the compiler on register usage, and it allows you to tell the compiler that you want this and that to be loaded from memory and placed in registers for you, and finally, it allows the compiler to be warned that you have clobbered this register or that one.
However, all of that could have been done by simply writing standard-conforming C code and then writing an assembler module, and calling the extension as a normal function. Perhaps ages ago the procedure call machine op was too slow to tolerate, but you won't notice today.
I believe the real answer is that it is easier, once you know the contraint DSL. People just throw in an asm and obfuscate the C program rather than go to the trouble of modifying the Makefile and adding a new module to the build and deploy workflow.

This isn't really an answer, but kind of an extended comment on other peoples' answers.
Inline assembly is still used to access CPU features. For instance, in the ARM chips used in cell phones, different manufacturers distinguish their offerings via special features that require unusual machine language instructions that would have no equivalent in C/C++.
Back in the 80s and early 90s, I used inline assembly a lot for optimizing loops. For instance, C compilers targeting 680x0 processors back then would do really stupid things, like:
calculate a value and put it in data register D1
PUSH D1, A7 # Put the value from D1 onto the stack in RAM
POP D1, A7 # Pop it back off again
do something else with the value in D1
But I haven't needed to do that in, oh, probably fifteen years, because modern compilers are much smarter. In fact, current compilers will sometimes generate more efficient code than most humans would. Especially given CPUs with long pipelines, branch prediction, and so on, the fastest-executing sequence of instructions is not always the one that would make most sense to a human. So you can say, "Do A B C D in that order", and the compiler will scramble the order all around for greater efficiency.
Playing a little with inline assembly is fine for starters, but if you're serious, I echo those who suggest you move to a "real" assembler after a while.

Manual optimization of loops that are executed a lot. This article is old, but can give you an idea about the kinds of optimizations hand-coded assembly is used for.
You can also use the assembler gcc uses directly. It's called as (see man as). However, many books and articles on assembly assume you are using a DOS or Windows environment. So it might be kind of hard to learn on Linux (maybe running FreeDOS on a virtual machine), because you not only need to know the processor (you can usually download the official manuals) you code for but also how hook to into the OS you are running.
A nice beginner book using DOS is the one by Norton and Socha. It's pretty old (the 3rd and latest edition is from 1992), so you can get used copies for like $0.01 (no joke). The only book I know of that is specific to Linux is the free "Programming from the Ground Up"

How do modern compilers use mmx/3dnow/sse instructions?

I've been reading up on the x86 instruction set extensions, and they only seem useful in some quite specific circumstances (eg HADDPD - (Horizontal-Add-Packed-Double) in SSE3). These require a certain register layout that needs to be either deliberately set up, or occur from the series of instructions before it. How often do general-purpose compilers like gcc actually use these instructions (or a subset thereof), or are they mainly to be used in hand-coded assembler? How does the compiler detect where it is appropriate to use SIMD instructions?

Generally, few compilers use them. GCC and Visual Studio arn't usually able to use the SIMD instructions. If you enable SSE as a compiler flag, it will use the scalar SSE instructions for regular floating-point operations, but generally, don't expect the vectorized ones to be used automatically. Recent versions of GCC might be able to use them in some cases, but didn't work last I tried. Intel's C++ compiler is the only big compiler I know of that is able to auto-vectorize some loops.
In general though, you'll have to use them yourself. Either in raw assembler, or by using compiler intrinsics. In general, I'd say intrinsics are the better approach, since they better allow the compiler to understand the code, and so schedule and optimize, but in practice, I know MSVC at least doesn't always generate very efficient code from intrinsics, so plain asm may be the best solution there. Experiment, see what works. But don't expect the compiler to use these instructions for you, unless you 1) use the right compiler, and 2) write fairly simple loops that can be trivially vectorized.
Update 2012
Ok, so three years have passed since I wrote this answer. GCC has been able to auto-vectorize (simple) code for a couple of years now, and in VS2012, MSVC finally gains the same capability. Of course, the main part of my answer still applies: compilers can still only vectorize fairly trivial code. For anything more complex, you're stuck fiddling with intrinsics or inline asm.

Mono can use SIMD extensions as long as you use its classes for vectors. You can read about it here: http://tirania.org/blog/archive/2008/Nov-03.html
GCC should do some automatic vectorisation as long as you're using -O3 or a specific flag. They have an info page here: http://gcc.gnu.org/projects/tree-ssa/vectorization.html

The question of how to exploit SSE and other small vector units automatically (without direction from the programmer in the form of special language constructs or specially blessed compiler "intrinsics") has been a topic of compiler research for some time. Most results seem to be specialized to a particular problem domain, such as digital signal processing. I have not kept up with the literature on this topic, but what I have read suggests that exploiting the vector (SSE) unit is still a topic for research, and that one should have low expectations of general-purpose compilers commonly used in the field.
Suggested search term: vectorizing compiler

I have seen gcc use sse to zero out a default std::string object. Not a particularly powerful use of sse, but it exists. In most cases, though you will have to write your own.
I know this because I had allowed the stack to become unaligned and it crashed, otherwise I probably wouldn't have noticed!

If you use the vector pascal compiler you will get efficient SIMD code for types for which SIMD gives an advantage. Basically this is anything of length less than 64 bits. ( for 64 bit reals it is actually slower to do SIMD).
Latest versions of the compiler will also automatically parallelise accross cores

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio