I would like to get a deep understanding of OpenMP and its internal mechanisms, its data structures, and algorithms at an operating system level (for example, I suppose that for task affinity each place has its own task queue and in the case of untied tasks there is a task migration / stealing between run queues). Is there a database of technical papers that describe all this stuff as well as a guide describing what files in the gcc source tree are of interest?
I search the gcc source tree for filenames containing the string omp and i found some results but i don't know if these are all the associated files.
I would like to get a deep understanding of OpenMP and its internal mechanisms, its data structures, and algorithms at an operating system level
OpenMP is a standard, not an implementation. There are multiple implementations. Two are mainstream: GOMP associated with GCC and IOMP associated to Clang (and ICC).
Is there a database of technical papers that describe all this stuff as well as a guide describing what files in the GCC source tree are of interest?
AFAIK, not for GOMP. The code is the reference for this as well as the associated documentation (more specifically this page). The code is modified over time so document would quickly get obsolete (especially since there are versions of the OpenMP specification released relatively frequently causing sometime changes deep in the target implementation).
Note that there are some generated documentation online like this one but it looks like it is really obsolete now.
I search the gcc source tree for filenames containing the string omp and i found some results but i don't know if these are all the associated files.
Generally, an OpenMP implementation is written in two parts. One part is in the compiler and it is meant to parse pragmas so to then convert them to runtime calls (eg. parallel sections), or to tune the compiler behaviour (eg. SIMD directives). This is a kind of front-end. Another part is the runtime which is the heat of the implementation, a kind of back-end, where the dynamic data structure lies (eg. for tasks, barrier, parallel sections, etc.). GOMP is implemented that way. That being said, the two parts are more closely interrelated than other implementations like IOMP (AFAIK, GOMP is not meant to be used from another compiler than GCC).
The code is available here. "loop.c" is probably the first file to look to understand the implementation. GOMP is relatively simple overall.
I suppose that for task affinity each place has its own task queue and in the case of untied tasks there is a task migration / stealing between run queues
Task affinity is a new feature that is only supported recently (in GCC 12). I would not be surprised if it would be a no-op (this is not rare for new features). In fact, "affinity.c" tends to confirm this. As for the queues, the last time I looked the code, GOMP was using a central queue (that does not scale).
Related
After looking into the recently announce support for WebAssembly, it occurs to me that it would dramatically increase its utility if there were some way to:
Have TurboFan, the successor to the V8 JIT Crankshaft optimizer output all the assembly code it generates along with the static type signatures, and execution profile of that generated code.
Permit the programmer to provide his own asm.js/WebAssembly code for specific static type signatures that override the optimizer.
Is there some way to do this already?
There is some indication that it may be from the following passage from this article:
Under the hood, the WebAssembly implementation in V8 is designed to
reuse much of the existing JavaScript virtual machine infrastructure,
specifically the TurboFan compiler. A specialized WebAssembly decoder
validates modules by checking types, local variable indices, function
references, return values, and control flow structure in a single
pass. The decoder produces a TurboFan graph which is processed by
various optimization passes and finally turned into machine code by
the same backend which generates machine code for optimized JavaScript
and asm.js. In the next few months, the team will concentrate on
improving the startup time of the V8 implementation through compiler
tuning, parallelism, and compilation policy improvements.
To expand on the idea for a more general audience:
Typical top-down optimization involves high level programming and then execution profiling to identify which pieces of code require more effort. This is true whether the optimization is automated code generation or manual coding of optimized code. In the case of dynamically typed languages you'll frequently want to go beyond just optimizing dynamically-typed algorithms and provide code specialized for specific static types. This is, in fact, what the V8 JIT optimizer does automatically. If humans want to manually provide some particularly 'hot' specialized cases, they'd need to inform the automated optimizer, somehow, that they have already done the work so the automated optimizer can incorporate the manually optimized code rather than automatically generating suboptimal code.
No, that's not possible, and it's highly unlikely that it ever will be, given that it would probably require piercing all sorts of abstraction barriers within the system. The complexity would be enormous, and the effect on maintainability and security would probably be severe.
The web interface to WebAssembly modules (through the Wasm object) provides a clean and simple way to interface between JS and Wasm. In the future, ES6 modules might simplify interop further. It's not obvious what advantage a complicated mechanism like you propose would have over that.
For 1. you can play with the following flags:
trace_turbo: trace generated TurboFan IR
trace_turbo_graph: trace generated TurboFan graphs
trace_turbo_cfg_file: trace turbo cfg graph (for C1 visualizer) to a given file name
trace_turbo_types: trace TurboFan's types
trace_turbo_scheduler: trace TurboFan's scheduler
trace_turbo_reduction: trace TurboFan's various reducers
trace_turbo_jt: trace TurboFan's jump threading
trace_turbo_ceq: trace TurboFan's control equivalence
turbo_stats: print TurboFan statistics
They may change in future versions of V8 and aren't a stable API.
TurboFan is pretty complicated in that it consumes information from the baseline JIT / the interpreter, and may get to that information after deopt. The compiler isn't always a straight pipeline from JS / wasm to assembly. Inlining and a bunch of other things affect what happens.
For 2.: write wasm code or valid asm.js in the first place.
We've discussed performing a bunch of different types of dynamic tracing, caching traces (and allowing injection of traces for testing), but that's probably not something we'd expose considering that there's already a way to give the compiler precise type information!
I've been looking over a few tutorials for JIT and allocating heaps of executable memory at runtime. This is mainly a conceptual question, so please correct me if I got something wrong.
If I understand it correctly, a JIT takes advantage of a runtime interpreter/compiler that outputs native or executable code and, if native binary, places it in an executable code heap in memory, which is OS-specific (e.g. VirtualAlloc() for Windows, mmap() for Linux).
Additionally, some languages like Erlang can have a distributed system such that each part is separated from each other, meaning that if one fails, the others can account for such a case in a modular way, meaning that modules can also be switched in and out at will if managed correctly without disturbing overall execution.
With a runtime compiler or some sort of code delivery mechanism, wouldn't it be feasible to load code at runtime arbitrarily to replace modules of code that could be updated?
Example
Say I have a sort(T, T) function that operates on T or T. Now, suppose I have a merge_sort(T,T) function that I have loaded at runtime. If I implement a sort of ledger or register system such that users of the first sort(T,T) can reassign themselves to use the new merge_sort(T,T) function and detect when all users have adjusted themselves, I should then be able to deallocate and delete sort(T,T) from memory.
This basically sounds a lot like a JIT, but the attractive part, to me, was the aspect where you can swap out code arbitrarily at runtime for modules. That way, while a system is not under a full load such that each module is being used, modules could be automated to switch to new code, if needed, and etc. Theoretically, wouldn't this be a way to implement patches such that a user who uses a program should never have to "restart" the program if the program can swap out code silently in the individual modules? I'd imagine much larger distributed systems can make use of this, but what about smaller ones?
Additionally, some languages like Erlang can have a distributed system
such that each part is separated from each other, meaning that if one
fails, the others can account for such a case in a modular way,
meaning that modules can also be switched in and out at will if
managed correctly without disturbing overall execution.
You're describing how to make a fault-tolerant system which is entirely different from replacing code at run-time (known at Dynamic Software Update or DSU). Indeed, in Erlang, you can have one process monitoring other processes and if one fails, it will migrate the work to another process to keep the system running as expected. Note that DSU is not used to implement fault-tolerance. They are different features with different purposes.
Say I have a sort(T, T) function that operates on T or T. Now, suppose
I have a merge_sort(T,T) function that I have loaded at runtime. If I
implement a sort of ledger or register system such that users of the
first sort(T,T) can reassign themselves to use the new merge_sort(T,T)
function and detect when all users have adjusted themselves, I should
then be able to deallocate and delete sort(T,T) from memory.
This is called DSU and is used to be able to do any of the following tasks without the need to take the system down:
Fix one or more bugs in a piece of code.
Patch security holes.
Employ a more efficient code.
Deploy new features.
Therefore, any app or system can use DSU so that it can perform these tasks without requiring a restart.
Erlang enables you to perform DSU in addition to facilitating fault-tolerance as discussed above. For more information, refer to this Erlang write paper.
There are numerous ways to implement DSU. Since you're interested in JIT compilers and assuming that by "JIT compiler" you mean the component that not only compiles IL code but also allocates executable memory and patches function calls with binary code addresses, I'll discuss how to implement DSU in JIT environments. The JIT compiler has to support the following two features:
The ability to obtain or create new binary code at run-time. If you have IL code, no need to allocate executable memory yet since it has to be compiled.
The ability to replace a piece of IL code (which might have already been JITted) or binary code with the new piece of code.
Clearly, with these two features, you can perform DSU on a single function. Swapping a whole module or library requires swapping all the functions and global variables exported by that module.
I working on converting an existing program to take advantage of some parallel functionality of the STL.
Specifically, I've re-written a big loop to work with std::accumulate. It runs, nicely.
Now, I want to have that accumulate operation run in parallel.
The documentation I've seen for GCC outline two specific steps.
Include the compiler flag -D_GLIBCXX_PARALLEL
Possibly add the header <parallel/algorithm>
Adding the compiler flag doesn't seem to change anything. The execution time is the same, and I don't see any indication of multiple core usage when monitoring the system.
I get an error when adding the parallel/algorithm header. I thought it would be included with the latest version of gcc (4.7).
So, a few questions:
Is there some way to definitively determine if code is actually running in parallel?
Is there a "best practices" way of doing this on OS X? (Ideal compiler flags, header, etc?)
Any and all suggestions are welcome.
Thanks!
See http://threadingbuildingblocks.org/
If you only ever parallelize STL algorithms, you are going to disappointed in the results in general. Those algorithms generally only begin to show a scalability advantage when working over very large datasets (e.g. N > 10 million).
TBB (and others like it) work at a higher level, focusing on the overall algorithm design, not just the leaf functions (like std::accumulate()).
Second alternative is to use OpenMP, which is supported by both GCC and
Clang, though is not STL by any means, but is cross-platform.
Third alternative is to use Grand Central Dispatch - the official multicore API in OSX, again hardly STL.
Forth alternative is to wait for C++17, it will have Parallelism module.
I've noticed that some static analyzers operate on source code, while others operate on bytecode (e.g., FindBugs). I'm sure there are even some that work on object code.
My question is a simple one, what are the advantages and disadvantages of writing different kinds of static analyzers for different levels of analysis?
Under "static analyzers" I'm including linters, bug finders, and even full-blown verifiers.
And by levels of analysis I would include source code, high-level IRs, low-level IRs, bytecode, object code, and compiler plugins that have access to all phases.
These different facets can influence the level at which an analyzer may decide to work:
Designing a static analyzer is a lot of work. It would be a shame not to factor this work for several languages compiled to the same bytecode, especially when the bytecode retains most of the structure of the source program: Java (FindBugs), .NET (various tools related to Code Contracts). In some cases, the common target language was made up for the purpose of analysis although the compilation scheme wasn't following this path.
Related to 1, you may hope that your static analyzer will be a little less costly to write if it works on a normalized version of the program with a minimum number of constructs. When authoring static analyzers, having to write the treatment for repeat until when you have already written while do is a bother. You may structure your analyzer so that several functions are shared for these two cases, but the care-free way to handle this is to translate one to the other, or to translate the source to an intermediate language that only has one of them.
On the other hand as already pointed out in Flash Sheridan's answer, source code contains the most information. For instance, in languages with fuzzy semantics, bugs at the source level may be removed by compilation. C and C++ have numerous "undefined behaviors" where the compiler is allowed to do anything, including generating a program that works accidentally. Fine, you might think, if the bug is not in the executable it's not a problematic bug. But when you ever re-compile the program for another architecture or with the next version of the compiler, the bug may appear again. This is one reason for not doing the analysis after any phase that might potentially remove bugs.
Some properties can only be checked with reasonable precision on compiled code. That includes absence of compiler-introduced bugs as pointed out again by Flash Sheridan, but also worst-case execution time. Similarly, many languages do not let you know what floating-point code does precisely unless you look at the assembly generated by the compiler (this is because existing hardware does not make it convenient for them to guarantee more). The choice is then to write an imprecise source-level analyzer that takes into account all possibilities, or to analyze precisely one particular compilation of a floating-point program, as long as it is understood that it is that precise assembly code that will be executed.
Source code analysis is the most generally useful, of course; sometimes heuristics even need to analyze comments or formatting. But you’re right that even object code analysis can be necessary, e.g., to detect bugs introduced by GCC misfeatures. Thomas Reps, head of GrammaTech and a Wisconsin professor, gave a good talk on this at Stanford a couple of years ago: http://pages.cs.wisc.edu/~reps/#TOPLAS-WYSINWYX.
Does anyone have any suggestions for assembly file analysis tools? I'm attempting to analyze ARM/Thumb-2 ASM files generated by LLVM (or alternatively GCC) when passed the -S option. I'm particularly interested in instruction statistics at the basic block level, e.g. memory operation counts, etc. I may wind up rolling my own tool in Python, but was curious to see if there were any existing tools before I started.
Update: I've done a little searching, and found a good resource for disassembly tools / hex editors / etc here, but unfortunately it is mainly focused on x86 assembly, and also doesn't include any actual assembly file analyzers.
What you need is a tool for which you can define an assembly language syntax, and then build custom analyzers. You analyzers might be simple ("how much space does an instruction take?") or complex ("How many cycles will this isntruction take to execute?" [which depends on the preceding sequence of instructions and possibly a sophisticated model of the processor you care about]).
One designed specifically to do that is the New Jersey Machine Toolkit. It is really designed to build code generators and debuggers. I suspect it would be good at "instruction byte count". It isn't clear it is good at more sophisticated analyses. And I believe it insists you follow its syntax style, rather than yours.
One not designed specifically to do that, but good at parsing/analyzing langauges in general is our
DMS Software Reengineering Toolkit.
DMS can be given a grammar description for virtually any context free language (that covers most assembly language syntax) and can then parse a specific instance of that grammar (assembly code) into ASTs for further processing. We've done with with several assembly langauges, including the IBM 370, Motorola's 8 bit CPU line, and a rather peculiar DSP, without trouble.
You can specify an attribute grammar (computation over an AST) to DMS easily. These are great way to encode analyses that need just local information, such as "How big is this instruction?". For more complex analysese, you'll need a processor model that is driven from a series of instructions; passing such a machine model the ASTs for individual instructions would be an easy way to apply a machine model to compute more complex things as "How long does this instruction take?".
Other analyses such as control flow and data flow, are provided in generic form by DMS. You can use an attribute evaluator to collect local facts ("control-next for this instruction is...", "data from this instruction flows to,...") and feed them to the flow analyzers to compute global flow facts ("if I execute this instruction, what other instructions might be executed downstream?"..)
You do have to configure DMS for your particular (assembly) language. It is designed to be configured for tasks like these.
Yes, you can likely code all this in Python; after all, its a Turing machine. But likely not nearly as easily.
An additional benefit: DMS is willing to apply transformations to your code, based on your analyses. So you could implement your optimizer with it, too. After all, you need to connect the analysis indication the optimization is safe, to the actual optimization steps.
I have written many disassemblers, including arm and thumb. Not production quality but for the purposes of learning the assembler. For both the ARM and Thumb the ARM ARM (ARM Architectural Reference Manual) has a nice chart from which you can easily count up data operations from load/store, etc. maybe an hours worth of work, maybe two. At least up front, you would end up with data values being counted though.
The other poster may be right, as with the chart I am talking about it should be very simple to write a program to examine the ASCII looking for ldr, str, add, etc. No need to parse everything if you are interested in memory operations counts, etc. Of course the downside is that you are likely not going to be able to examine loops. One function may have a load and store, another may have a load and store but have it wrapped by a loop, causing many more memory operations once executed.
Not knowing what you really are interested in, my guess is you might want to simulate the code and count these sorts of things. I wrote a thumb simulator (thumbulator) that attempts to do just that. (and I have used it to compare llvm execution vs gcc execution when it comes to number of instructions executed, fetches, memory operations, etc) The problem may be that it is thumb only, no ARM no Thumb2. Thumb2 could be added easier than ARM. There exists an armulator from arm, which is in the gdb sources among other places. I cant remember now if it executes thumb2. My understanding is that when arm was using it would accurately tell you these sorts of statistics.
You can plug your statistics into LLVM code generator, it's quite flexible and it is already collecting some stats, which could be used as an example.