Measuring execution time in the D language

Measuring execution time in the D language - performance

I'm new to the D language and need to measure the execution time of an algorithm. What are my options? Is there already some built-in solution? I could not find anything conclusive on the web.

One way is to use -profile command line parameter. After you run the program, it will create file trace.log where you can find run time for each function. This of course will slow down your program as the compiler will insert time counting code into each your function. This method is used to find relative speed of functions, to identify which you should optimize to improve app speed with minimum effort.
Second options is to use std.datetime.StopWatch class. See the example in the link.
Or even better suited might be to directly use std.datetime.benchmark function.
Don't forget:
When benchmarking use these dmd compiler flags to achieve maximum optimization -release -O -inline -noboundscheck.
Never benchmark debug builds.
Make sure you don't call any library code inside benchmarked functions - You would be benchmarking performance of library implementation instead of your own code.
Additionally you may consider using LDC or GDC compilers. Both of them provide better optimizations / app run speed than DMD.

If your algorithm can be called from the commandline there's a nifty utility written in D that will run your program a number of time and print out the distribution of the average time taken and all sort of other useful numbers.
It's called avgtime and it's here: https://github.com/jmcabo/avgtime

std.benchmark is in the review queue.

Related

`set -x` for golang: Print every executed line?

Is there something like shell feature set -x for golang?
I would like to see every line of code which gets executed.
Lines of the standard library should not be printed.

You can combine, using pprof:
profiling, which can help you see who calls what, and for how long
(ofabry/go-callvis can also help to see a call graph)
Its Weblist view which shows each executed line and their cost:
See "Interactive Profiling" in this guide.
This won't display each line executed in order, but allows you to explore after a run what was executed.
Note that Go 1.20/1.21 (Q4 2022/Q2 2023) will include (since #55022 is accepted):
Profile-Guided Optimization (PGO) for Go
Inefficiencies in Go programs can be isolated via profiling tools such as pprof and linux profiler perf. Such tools can pinpoint source code regions where most of the execution time is spent.
Unlike other optimizing compilers such as LLVM, the Go compiler does not yet perform Profile-Guided Optimization(PGO).
PGO uses information about the code’s runtime behavior to guide compiler optimizations such as inlining, code layout etc. PGO can improve application performance in the range 15-30% [LLVM, AutoFDO].
In this proposal, we extend the Go compiler with PGO.
Specifically, we incorporate the profiles into the frontend of the compiler to build a call graph with node & edge weights (called WeightedCallGraph). The Inliner subsequently uses the WeightedCallGraph to perform profile-guided inlining which aggressively inlines hot functions.
We introduce a profile-guided code specialization pass that is tightly integrated with the Inliner and eliminates indirect method call overheads in hot code paths.
Furthermore, we annotate IR instructions with their associated profile weights and propagate these to the SSA-level in order to facilitate profile-guided basic-block layout optimization to benefit from better instruction-cache and TLB performance.
Finally, we extend Go's linker to also consume the profiles directly and perform function reordering optimization across package boundaries -- which also helps instruction-cache and TLB performance.
The format of the profile file consumed by our PGO is identical to the protobuf format produced by the pprof tool. This format is rich enough to carry additional hardware performance counter information such as cache misses, LBR, etc.
Existing perf_data_converter tool from Google can convert a perf.data file produced by the Linux perf into a profile.proto file in protobuf format.
There will be a new compilation flow proposed in Go for PGO

Since Go is a compiled language the executable doesn't contain any original source code, so output would be impossible. The closest you can come to the approach you want is to run your Go project in Debug mode and step through every line of code.
This way you can decide while running to jump into a function or just execute it and jump over it, because the debugger won't know what you consider "standard library" and what should traced line-by-line and what not.
On the other hand, Go can be heavily multi-threaded with go routines, so printing every executed line could be a mess in a minute (sometimes I have more than a houndred routines running the same time).

ode45 mex file runs slow

This is the function trial2 which creates the system of differential equations:
function xdot=trial2(t,x)
delta=0.1045;epsilon=0.0048685;
xdot=[x(2);(-delta-epsilon*cos(t))*x(1)-0.7*delta*abs(x(1))];
Then, I try and solve this using ode45:
[t,x]=ode45('trial2',[0 10000000],[0;1]);
plot(t,x(:,1),'r');
But, this takes approximately 15 minutes. So I tried to improve the run time by creating a mex file. I found an ode45 mex equivalent function ode45eq.c. Then, I tried to solve by:
mex ode45eq.c;
[t,x]=ode45eq('trial2',[0 10000000],[0;1]);
plot(t,x(:,1),'r');
But this seems to run even slower. What could be the reason and how can I improve the run time to 2–3 minutes? I have to perform a lot of these computations. Also, is it worth creating a standalone C++ file to solve the problem faster?
Also, I am running this on an Intel i5 processor 64-bit system with 8 gb RAM. How much gain in speed do you think I can get if I move to a better processor with say 16 gb RAM?

With current versions of Matlab, it's very unlikely that you'll see any performance improvement by compiling ode45 to C/C++ mex. In fact, the compiled version will almost certainly be slower as you found. There's a good reason that ode45 is written in pure Matlab, as opposed to being compiled down to a native C function: it has to call user functions written in Matlab on every iteration. Additionally, Matlab's ode45 is a very dynamic function that is capable of interacting with the Matlab environment in many ways during the course of integration (plotting output functions, event detection, interpolation, etc.). It's also probably more straightforward to safely handle dynamic memory allocation in Matlab, than in C.
Your C code calls your user function via mexCallMATLAB. This function is not really meant for repeated calls, especially if they transfer data back and forth. Doing what you're trying to do would likely require new mex APIs and possibly changes to the Matlab language.
If you want faster numerical integration, you're going to have to give up the convenience of being able to write your integrations functions (i.e., trial2 in your example) in Matlab. You'll need to "hard code" your integration functions and compile them along with the integration scheme itself. With a detailed knowledge about your problem, and decent programming skills, you can write a tight integration loop and it may be possible to achieve an order of magnitude speedup in some cases.
Lastly, your trial2 function has an absolute value as well as an oscillating trigonometric function in it. Is this differential equation stiff? Have you tried other solvers, e.g., ode15s? Compare the outputs even over a shorter time period. And you may find that you get a bit of a speed-up (~25% on my machine) if you use the modern way of passing function handles instead of strings to ode45:
[t,x] = ode45(#trial2,[0 10000000],[0;1]);
The trial2 function can still be in a separate M-file or it can be a sub-function in the same file with your call to ode45 (this file need to be a function file, not a script of course).

What you did is basically replacing the todays implementation with the implementation of 1993, you have shown that mathworks did a great job improving the performance of the ode45 solver.
I don't see any possibility to improve the performance in this case, you can assume that such a fundamental part of MATLAB like ode45 is implemented in a optimal way, replacing it with some other code to mex it is not the solution. Everything you could gain using a mex function is cutting of the overhead of input/output handling which is implemented in m. Probably less than 0.1s execution time.

What do we need to define while using parallel optimization flag?

I have a program with more than 100 subroutines and I am trying to make this code to run faster and I am trying to compile these subroutines using parallel flag. I was wondering what variable or parameters do I need to define in the program if I want to use the parallel flag. Just using the parallel optimization flag increased the run time for my program compared to the one without parallel flag.
Any suggestions is highly appreciated. Thanks a lot.
Best Regards,
Jdbaba

I can give you some general guidelines, but without knowing your specific compiler and platform/OS I won't be able to help you specifically. As far as I know, all of the autoparallelization schemes that are used in Fortran compilers end up using either OpenMP or MPI commands to split the loops out into either threads or processes. The issue is that there is a certain amount of overhead associated with those schemes. For instance, in one case I had a program that used an optimization library which was provided by a vendor as a compiled library without optimization within it. As all of my subroutines and functions were either outside or inside the large loop of the optimizer, and since there was only object data, the autoparallelizer wasn't able to perform ipo and as such it failed to use more than the one core. The run times in this case, due to the DLL that was loaded for OpenMP, the /qparallel actually added ~10% to the run time.
As a note, autoparallelizers aren't magic. Essentially all they are doing is the same type of thing that the autovectorization techniques do, which is to look for loops that have no data that are dependent upon the previous iteration. If it detects that variables are changed between iterations or if the compiler can't tell, then it will not attempt to parallelize the loop.
If you are using the Intel Fortran compiler, you can turn on a diagnostic switch "/qpar-report3" or "-par-report3" to give you information as to the dependency tree of loops to see why they failed to optimize. If you don't have access to large sections of the code you are using, in particular parts with major loops, there is a good chance that there won't be much opportunity in your code to use the auto-parallelizer.
In any case, you can always attempt to reduce dependencies and reformulate your code such that it is more friendly to autoparallelization.

If or function pointers in fortran

as it is so common with Fortran, I'm writing a massively parallel scientific code. In the beginning of my code I read my configuration file which tells me which type of solver I want to use. Now that means that in a subroutine (during the main run) I have
if(solver.eq.1)then
call solver1()
elseif(solver.eq.2)then
call solver2()
else
call solver3()
endif
Edit to avoid some confusion: This if is inside my time integration loop and I have one that is inside 3 nested loops.
Now my question is, wouldn't it be more efficient to use function pointers instead as the solver variable will not change during execution, except at the initialisation procedure.
Obviously function pointers are F2003. That shouldn't be a problem as long as I use gfortran 4.6. But I'm mainly using a BlueGene P, there is a f2003 compiler, so I suppose it's going to work there as well although I couldn't find any conclusive evidence on the web.

Knowing nothing about Fortran, this is my answer: The main problem with branching is that a CPU potentially cannot speculatively execute code across them. To mitigate this problem, branch prediction was introduced (which is very sophisticated in modern CPUs).
Indirect calls through a function pointer can be a problem for the prediction unit of the CPU. If it can't predict where the call will actually go, this will stall the pipeline.
I am quite sure that the CPU will correctly predict that your branch will always be taken or not taken because it is a trivial case of prediction.
Maybe the CPU can speculate across the indirect call, maybe it can't. This is why you need to test which is better.
If it cannot, you will certainly notice in your benchmark.
In addition, maybe you can hoist the if test out of your inner loop so it won't be called often. This will make the actual performance of the branch irrelevant.

If you only plan to use the function pointers once, at initialisation, and you are running codes on a BlueGene, isn't your concern for the efficiency mis-directed ? Generally, any initialisation which works is OK, if it takes 1sec instead of 1msec it's probably going to have 0 impact on total execution time.
Code initialisation routines for clarity, ease of modification, that sort of thing.
EDIT
My guess is that using function pointers rather than your current code will have no impact on execution speed. But it's just a (educated perhaps) guess and I'll be very interested in any data you gather on this question.

If you solver routines take a non-trivial runtime, then the trivial runtime of the IF statements is likely to be immaterial. If the sovler routines have a comparable runtine to the IF statement, then the total runtime is very short, so why do your care? This seems an optimization unlikely to pay off.
The first rule of runtime optimization is to profile your code is see what portions are consuming the runtime. Otherwise you are likely to optimize portions that are unimportant, which will accomplish nothing.
For what its worth, someone else recently had a very similar concern: Fortran Subroutine Pointers for Mismatching Array Dimensions

After a brief search I couldn't find the answer to the question, so I ran a little benchmark myself (see this link for the Makefile & dependencies). The benchmark consists of:
Draw random number to select method a, b, or c, which all perform a simple addition to their single integer argument
Call the chosen method 100 million times, using either a procedure pointer or if-statements
Repeat the above 5 times
The result with gfortran 4.8.5 on an CPU E5-2630 v3 # 2.40GHz is:
Time per call (proc. pointer): 1.89 ns
Time per call (if statement): 1.89 ns
In other words, there is not much of a performance difference!

Parallel STL algorithms in OS X

I working on converting an existing program to take advantage of some parallel functionality of the STL.
Specifically, I've re-written a big loop to work with std::accumulate. It runs, nicely.
Now, I want to have that accumulate operation run in parallel.
The documentation I've seen for GCC outline two specific steps.
Include the compiler flag -D_GLIBCXX_PARALLEL
Possibly add the header <parallel/algorithm>
Adding the compiler flag doesn't seem to change anything. The execution time is the same, and I don't see any indication of multiple core usage when monitoring the system.
I get an error when adding the parallel/algorithm header. I thought it would be included with the latest version of gcc (4.7).
So, a few questions:
Is there some way to definitively determine if code is actually running in parallel?
Is there a "best practices" way of doing this on OS X? (Ideal compiler flags, header, etc?)
Any and all suggestions are welcome.
Thanks!

See http://threadingbuildingblocks.org/
If you only ever parallelize STL algorithms, you are going to disappointed in the results in general. Those algorithms generally only begin to show a scalability advantage when working over very large datasets (e.g. N > 10 million).
TBB (and others like it) work at a higher level, focusing on the overall algorithm design, not just the leaf functions (like std::accumulate()).

Second alternative is to use OpenMP, which is supported by both GCC and
Clang, though is not STL by any means, but is cross-platform.
Third alternative is to use Grand Central Dispatch - the official multicore API in OSX, again hardly STL.
Forth alternative is to wait for C++17, it will have Parallelism module.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio