What do we need to define while using parallel optimization flag? - parallel-processing

I have a program with more than 100 subroutines and I am trying to make this code to run faster and I am trying to compile these subroutines using parallel flag. I was wondering what variable or parameters do I need to define in the program if I want to use the parallel flag. Just using the parallel optimization flag increased the run time for my program compared to the one without parallel flag.
Any suggestions is highly appreciated. Thanks a lot.
Best Regards,
Jdbaba

I can give you some general guidelines, but without knowing your specific compiler and platform/OS I won't be able to help you specifically. As far as I know, all of the autoparallelization schemes that are used in Fortran compilers end up using either OpenMP or MPI commands to split the loops out into either threads or processes. The issue is that there is a certain amount of overhead associated with those schemes. For instance, in one case I had a program that used an optimization library which was provided by a vendor as a compiled library without optimization within it. As all of my subroutines and functions were either outside or inside the large loop of the optimizer, and since there was only object data, the autoparallelizer wasn't able to perform ipo and as such it failed to use more than the one core. The run times in this case, due to the DLL that was loaded for OpenMP, the /qparallel actually added ~10% to the run time.
As a note, autoparallelizers aren't magic. Essentially all they are doing is the same type of thing that the autovectorization techniques do, which is to look for loops that have no data that are dependent upon the previous iteration. If it detects that variables are changed between iterations or if the compiler can't tell, then it will not attempt to parallelize the loop.
If you are using the Intel Fortran compiler, you can turn on a diagnostic switch "/qpar-report3" or "-par-report3" to give you information as to the dependency tree of loops to see why they failed to optimize. If you don't have access to large sections of the code you are using, in particular parts with major loops, there is a good chance that there won't be much opportunity in your code to use the auto-parallelizer.
In any case, you can always attempt to reduce dependencies and reformulate your code such that it is more friendly to autoparallelization.

Related

Halide AOT compilation

I'm trying to use Halide for a stencil computation that's part of a C++ code. I use buffers that point to my data structures that were defined in C++. I want to time the execution of my code, so I'm trying to use AOT compilation so that I can time just the execution and not the compile time.
So is there any way to use Buffers when declaring a vector of type Argument for the inputs when using AOT compilation?
If not, is there any other way to just time the execution when using Halide?
You can use Buffers in AOT compiled code. If you put them in the Argument vector they become arguments to the generated function. If you leave them out then they get embedded in the generated object file as constant arrays. Here's the tutorial on AOT compilation:
http://halide-lang.org/tutorials/tutorial_lesson_10_aot_compilation_generate.html
http://halide-lang.org/tutorials/tutorial_lesson_10_aot_compilation_run.html
The "compile_jit" method can be called on a Pipeline to cause compilation to happen. The same Func can then be called via the "realize" or "evaluate" methods and timed. This separates the execution time from the compile time while still using JIT. AOT is likely preferable if you are going to use that in your application anyway, but the compile_jit technique is useful if working with JIT for tests or simple benchmarks.
(There are a few other issues with separating out host/GPU buffer transfers from timing loops while also ensuring the GPU execution is not asynchronous with respect to the timing loops. Covering this is a bit more involved and I don't think it is relevant here.)

Is it possible to have a hotfix at runtime with executable memory heaps and a distributed system?

I've been looking over a few tutorials for JIT and allocating heaps of executable memory at runtime. This is mainly a conceptual question, so please correct me if I got something wrong.
If I understand it correctly, a JIT takes advantage of a runtime interpreter/compiler that outputs native or executable code and, if native binary, places it in an executable code heap in memory, which is OS-specific (e.g. VirtualAlloc() for Windows, mmap() for Linux).
Additionally, some languages like Erlang can have a distributed system such that each part is separated from each other, meaning that if one fails, the others can account for such a case in a modular way, meaning that modules can also be switched in and out at will if managed correctly without disturbing overall execution.
With a runtime compiler or some sort of code delivery mechanism, wouldn't it be feasible to load code at runtime arbitrarily to replace modules of code that could be updated?
Example
Say I have a sort(T, T) function that operates on T or T. Now, suppose I have a merge_sort(T,T) function that I have loaded at runtime. If I implement a sort of ledger or register system such that users of the first sort(T,T) can reassign themselves to use the new merge_sort(T,T) function and detect when all users have adjusted themselves, I should then be able to deallocate and delete sort(T,T) from memory.
This basically sounds a lot like a JIT, but the attractive part, to me, was the aspect where you can swap out code arbitrarily at runtime for modules. That way, while a system is not under a full load such that each module is being used, modules could be automated to switch to new code, if needed, and etc. Theoretically, wouldn't this be a way to implement patches such that a user who uses a program should never have to "restart" the program if the program can swap out code silently in the individual modules? I'd imagine much larger distributed systems can make use of this, but what about smaller ones?
Additionally, some languages like Erlang can have a distributed system
such that each part is separated from each other, meaning that if one
fails, the others can account for such a case in a modular way,
meaning that modules can also be switched in and out at will if
managed correctly without disturbing overall execution.
You're describing how to make a fault-tolerant system which is entirely different from replacing code at run-time (known at Dynamic Software Update or DSU). Indeed, in Erlang, you can have one process monitoring other processes and if one fails, it will migrate the work to another process to keep the system running as expected. Note that DSU is not used to implement fault-tolerance. They are different features with different purposes.
Say I have a sort(T, T) function that operates on T or T. Now, suppose
I have a merge_sort(T,T) function that I have loaded at runtime. If I
implement a sort of ledger or register system such that users of the
first sort(T,T) can reassign themselves to use the new merge_sort(T,T)
function and detect when all users have adjusted themselves, I should
then be able to deallocate and delete sort(T,T) from memory.
This is called DSU and is used to be able to do any of the following tasks without the need to take the system down:
Fix one or more bugs in a piece of code.
Patch security holes.
Employ a more efficient code.
Deploy new features.
Therefore, any app or system can use DSU so that it can perform these tasks without requiring a restart.
Erlang enables you to perform DSU in addition to facilitating fault-tolerance as discussed above. For more information, refer to this Erlang write paper.
There are numerous ways to implement DSU. Since you're interested in JIT compilers and assuming that by "JIT compiler" you mean the component that not only compiles IL code but also allocates executable memory and patches function calls with binary code addresses, I'll discuss how to implement DSU in JIT environments. The JIT compiler has to support the following two features:
The ability to obtain or create new binary code at run-time. If you have IL code, no need to allocate executable memory yet since it has to be compiled.
The ability to replace a piece of IL code (which might have already been JITted) or binary code with the new piece of code.
Clearly, with these two features, you can perform DSU on a single function. Swapping a whole module or library requires swapping all the functions and global variables exported by that module.

Measuring execution time in the D language

I'm new to the D language and need to measure the execution time of an algorithm. What are my options? Is there already some built-in solution? I could not find anything conclusive on the web.
One way is to use -profile command line parameter. After you run the program, it will create file trace.log where you can find run time for each function. This of course will slow down your program as the compiler will insert time counting code into each your function. This method is used to find relative speed of functions, to identify which you should optimize to improve app speed with minimum effort.
Second options is to use std.datetime.StopWatch class. See the example in the link.
Or even better suited might be to directly use std.datetime.benchmark function.
Don't forget:
When benchmarking use these dmd compiler flags to achieve maximum optimization -release -O -inline -noboundscheck.
Never benchmark debug builds.
Make sure you don't call any library code inside benchmarked functions - You would be benchmarking performance of library implementation instead of your own code.
Additionally you may consider using LDC or GDC compilers. Both of them provide better optimizations / app run speed than DMD.
If your algorithm can be called from the commandline there's a nifty utility written in D that will run your program a number of time and print out the distribution of the average time taken and all sort of other useful numbers.
It's called avgtime and it's here: https://github.com/jmcabo/avgtime
std.benchmark is in the review queue.

Parallel STL algorithms in OS X

I working on converting an existing program to take advantage of some parallel functionality of the STL.
Specifically, I've re-written a big loop to work with std::accumulate. It runs, nicely.
Now, I want to have that accumulate operation run in parallel.
The documentation I've seen for GCC outline two specific steps.
Include the compiler flag -D_GLIBCXX_PARALLEL
Possibly add the header <parallel/algorithm>
Adding the compiler flag doesn't seem to change anything. The execution time is the same, and I don't see any indication of multiple core usage when monitoring the system.
I get an error when adding the parallel/algorithm header. I thought it would be included with the latest version of gcc (4.7).
So, a few questions:
Is there some way to definitively determine if code is actually running in parallel?
Is there a "best practices" way of doing this on OS X? (Ideal compiler flags, header, etc?)
Any and all suggestions are welcome.
Thanks!
See http://threadingbuildingblocks.org/
If you only ever parallelize STL algorithms, you are going to disappointed in the results in general. Those algorithms generally only begin to show a scalability advantage when working over very large datasets (e.g. N > 10 million).
TBB (and others like it) work at a higher level, focusing on the overall algorithm design, not just the leaf functions (like std::accumulate()).
Second alternative is to use OpenMP, which is supported by both GCC and
Clang, though is not STL by any means, but is cross-platform.
Third alternative is to use Grand Central Dispatch - the official multicore API in OSX, again hardly STL.
Forth alternative is to wait for C++17, it will have Parallelism module.

how to minimize a programming language compile time?

I was thinking more about the programming language i am designing. and i was wondering, what are ways i could minimize its compile time?
Your main problem today is I/O. Your CPU is many times faster than main memory and memory is about 1000 times faster than accessing the hard disk.
So unless you do extensive optimizations to the source code, the CPU will spend most of the time waiting for data to be read or written.
Try these rules:
Design your compiler to work in several, independent steps. The goal is to be able to run each step in a different thread so you can utilize multi-core CPUs. It will also help to parallelize the whole compile process (i.e. compile more than one file at the same time)
It will also allow you to load many source files in advance and preprocess them so the actual compile step can work faster.
Try to allow to compile files independently. For example, create a "missing symbol pool" for the project. Missing symbols should not cause compile failures as such. If you find a missing symbol somewhere, remove it from the pool. When all files have been compiled, check that the pool is empty.
Create a cache with important information. For example: File X uses symbols from file Y. This way, you can skip compiling file Z (which doesn't reference anything in Y) when Y changes. If you want to go one step further, put all symbols which are defined anywhere in a pool. If a file changes in such a way that symbols are added/removed, you will know immediately which files are affected (without even opening them).
Compile in the background. Start a compiler process which checks the project directory for changes and compile them as soon as the user saves the file. This way, you will only have to compile a few files each time instead of everything. In the long run, you will compile much more but for the user, turnover times will be much shorter (= time user has to wait until she can run the compiled result after a change).
Use a "Just in time" compiler (i.e. compile a file when it is used, for example in an import statement). Projects are then distributed in source form and compiled when run for the first time. Python does this. To make this perform, you can precompile the library during the installation of your compiler.
Don't use header files. Keep all information in a single place and generate header files from the source if you have to. Maybe keep the header files just in memory and never save them to disk.
what are ways i could minimize its compile time?
No compilation (interpreted language)
Delayed (just in time) compilation
Incremental compilation
Precompiled header files
I've implemented a compiler myself, and ended up having to look at this once people started batch feeding it hundreds of source files. I was quite suprised what I found out.
It turns out that the most important thing you can optimize is not your grammar. It's not your lexical analyzer or your parser either. Instead, the most important thing in terms of speed is the code that reads in your source files from disk. I/O's to disk are slow. Really slow. You can pretty much measure your compiler's speed by the number of disk I/Os it performs.
So it turns out that the absolute best thing you can do to speed up a compiler is to read the entire file into memory in one big I/O, do all your lexing, parsing, etc. from RAM, and then write out the result to disk in one big I/O.
I talked with one of the head guys maintaining Gnat (GCC's Ada compiler) about this, and he told me that he actually used to put everything he could onto RAM disks so that even his file I/O was really just RAM reads and writes.
In most languages (pretty well everything other than C++), compiling individual compilation units is quite fast.
Binding/linking is often what's slow - the linker has to reference the whole program rather than just a single unit.
C++ suffers as - unless you use the pImpl idiom - it requires the implementation details of every object and all inline functions to compile client code.
Java (source to bytecode) suffers because the grammar doesn't differentiate objects and classes - you have to load the Foo class to see if Foo.Bar.Baz is the Baz field of object referenced by the Bar static field of the Foo class, or a static field of the Foo.Bar class. You can make the change in the source of the Foo class between the two, and not change the source of the client code, but still have to recompile the client code, as the bytecode differentiates between the two forms even though the syntax doesn't. AFAIK Python bytecode doesn't differentiate between the two - modules are true members of their parents.
C++ and C suffer if you include more headers than are required, as the preprocessor has to process each header many times, and the compiler compile them. Minimizing header size and complexity helps, suggesting better modularity would improve compilation time. It's not always possible to cache header compilation, as what definitions are present when the header is preprocessed can alter its semantics, and even syntax.
C suffers if you use the preprocessor a lot, but the actual compilation is fast; much of C code uses typedef struct _X* X_ptr to hide implementation better than C++ does - a C header can easily consist of typedefs and function declarations, giving better encapsulation.
So I'd suggest making your language hide implementation details from client code, and if you are an OO language with both instance members and namespaces, make the syntax for accessing the two unambiguous. Allow true modules, so client code only has to be aware of the interface rather than implementation details. Don't allow preprocessor macros or other variation mechanism to alter the semantics of referenced modules.
Here are some performance tricks that we've learned by measuring compilation speed and what affects it:
Write a two-pass compiler: characters to IR, IR to code. (It's easier to write a three-pass compiler that goes characters -> AST -> IR -> code, but it's not as fast.)
As a corollary, don't have an optimizer; it's hard to write a fast optimizer.
Consider generating bytecode instead of native machine code. The virtual machine for Lua is a good model.
Try a linear-scan register allocator or the simple register allocator that Fraser and Hanson used in lcc.
In a simple compiler, lexical analysis is often the greatest performance bottleneck. If you are writing C or C++ code, use re2c. If you're using another language (which you will find much more pleasant), read the paper aboug re2c and apply the lessons learned.
Generate code using maximal munch, or possibly iburg.
Surprisingly, the GNU assembler is a bottleneck in many compilers. If you can generate binary directly, do so. Or check out the New Jersey Machine-Code Toolkit.
As noted above, design your language to avoid anything like #include. Either use no interface files or precompile your interface files. This tactic dramatically reduces the burdern on the lexer, which as I said is often the biggest bottleneck.
Here's a shot..
Use incremental compilation if your toolchain supports it.
(make, visual studio, etc).
For example, in GCC/make, if you have many files to compile, but only make changes in one file, then only that one file is compiled.
Eiffel had an idea of different states of frozen, and recompiling didn't necessarily mean that the whole class was recompiled.
How much can you break up the compliable modules, and how much do you care to keep track of them?
Make the grammar simple and unambiguous, and therefore quick and easy to parse.
Place strong restrictions on file inclusion.
Allow compilation without full information whenever possible (eg. predeclaration in C and C++).
One-pass compilation, if possible.
One thing surprisingly missing in answers so far: make you you're doing a context free grammar, etc. Have a good hard look at languages designed by Wirth such as Pascal & Modula-2. You don't have to reimplement Pascal, but the grammar design is custom made for fast compiling. Then see if you can find any old articles about the tricks Anders pulled implementing Turbo Pascal. Hint: table driven.
it depends on what language/platform you're programming for. for .NET development, minimise the number of projects that you have in your solution.
In the old days you could get dramatic speedups by setting up a RAM drive and compiling there. Don't know if this still holds true, though.
In C++ you could use distributed compilation with tools like Incredibuild
A simple one: make sure the compiler can natively take advantage of multi-core CPUs.
Make sure that everything can be compiled the fist time you try to compile it. E.g. ban forward references.
Use a context free grammar so that you can find the correct parse tree without a symbol table.
Make sure that the semantics can be deduced from the syntax so you can construct the correct AST directly rather than by mucking with a parse tree and symbol table.
How serious a compiler is this?
Unless the syntax is pretty convoluted, the parser should be able to run no more than 10-100 times slower than just indexing through the input file characters.
Similarly, code generation should be limited by output formatting.
You shouldn't be hitting any performance issues unless you're doing a big, serious compiler, capable of handling mega-line apps with lots of header files.
Then you need to worry about precompiled headers, optimization passes, and linking.
I haven't seen much work done for minimizing the compile time. But some ideas do come to mind:
Keep the grammar simple. Convoluted grammar will increase your compile time.
Try making use of parallelism, either using multicore GPU or CPU.
Benchmark a modern compiler and see what are the bottlenecks and what you can do in you compiler/language to avoid them.
Unless you are writing a highly specialized language, compile time is not really an issue..
Make a build system that doesn't suck!
There's a huge amount of programs out there with maybe 3 source files that take under a second to compile, but before you get that far you'd have to sit through an automake script that takes about 2 minutes checking things like the size of an int. And if you go to compile something else a minute later, it makes you sit through almost exactly the same set of tests.
So unless your compiler is doing awful things to the user like changing the size of its ints or changing basic function implementations between runs, just dump that info out to a file and let them get it in a second instead of 2 minutes.

Resources