likely and unlikely macros - windows

Are there any standard ways ( using profilers ) to check if using these gcc recognized branch prediction macros can benefit certain clock cycles in terms of instruction pipelining? How can we measure this with and without usage of these macros in a program? Is measuring the elapsed time the only way to do it?
Are there similar branch prediction macros in Windows ( assume keywork for example? )
-Kartlee

I’m not familiar with any profilers that will show branch efficiencies. The Linux time program should work well enough to help you benchmark.
On all modern x86 CPUs, JMPcc instructions are faster if they don’t branch and instead just fall through to the next instruction.
GCC’s __builtin_expect function provides a hint to the compiler—it tells which side of an if() should be the fall-through and which side should be the branch. You should only use this function if you are 100% sure about it. There is no equivalent function for VC++. I’m not sure about ICC.
A better way to do this is to avoid these non-standard functions and use Profile Guided Optimization (PGO), in which you run the program and it records all these branches to figure out where stuff goes.

Related

How can I find out the time required to perform of a particular instruction for Xtensa microprocessor for e.g. wsr / rsr?

I am trying to optimize a code on esp32 which uses xtensa LX6 microprocessors , I wanted to know the cost of wsr and rsr instructions which are used to read or wirte in the special registers .
First of all, optimize only after you have profiled, and came to the conclusion that this is your bottleneck.
In rare cases (like a function that accesses registers) it might be a good idea to optimize code generated by the compiler, but usually, that is not where the bottleneck is.
In general, when optimizing compiler-generated code:
write a very simple function (that produces the asm you think you can optimize)
create TESTS for this function (people tend to introduce weird bugs when writing their own asm)
run & measure
replace the generated asm with your optimizations
test you haven't screwed up the logic
run & measure again
Even if you managed to optimize, think about readability, portability, maintenance, etc. before choosing your optimized version.

Coding Style for GCC ARM Optimization levels

I've been doing embedded firmware since 1977 but I have never enabled optimization on any of the compilers I've used.
I'm working with the GCC ARM compiler for a CM4 micro.
Code runs as expected with NO optimization.
I use a lot of structures and pointers in my code.
I use volatile when a variable can change from within an interrupt routine.
I recently need to speed up execution of my code so I used optimization level -Og (first time ever enabling optimization) - which still gives good debugging and increased performance where I wanted it.
My issue/concern is the code behaves really flacky!!!
It behaves OK - then I make a small change - and it mis-behaves -- changes each time I run the compiler - almost like there is an issue with address alignment or instructions have been completely removed.
I can change some variables to volatile and that also changes behavior but I don't understand why that would affect how global variables (not modified in an interrupt routine) would have a positive change in behavior.
I'm about ready to give up with over-all optimization and look at using function specific optimization since I know which functions affect the performance I'm trying to improve.
Can anyone explain how coding style can be impacted negatively with optimization?
Any good documents that address coding style with optimization in mind?
Does GCC function level optimization work well?
Thanks.
Joe

Parallel STL algorithms in OS X

I working on converting an existing program to take advantage of some parallel functionality of the STL.
Specifically, I've re-written a big loop to work with std::accumulate. It runs, nicely.
Now, I want to have that accumulate operation run in parallel.
The documentation I've seen for GCC outline two specific steps.
Include the compiler flag -D_GLIBCXX_PARALLEL
Possibly add the header <parallel/algorithm>
Adding the compiler flag doesn't seem to change anything. The execution time is the same, and I don't see any indication of multiple core usage when monitoring the system.
I get an error when adding the parallel/algorithm header. I thought it would be included with the latest version of gcc (4.7).
So, a few questions:
Is there some way to definitively determine if code is actually running in parallel?
Is there a "best practices" way of doing this on OS X? (Ideal compiler flags, header, etc?)
Any and all suggestions are welcome.
Thanks!
See http://threadingbuildingblocks.org/
If you only ever parallelize STL algorithms, you are going to disappointed in the results in general. Those algorithms generally only begin to show a scalability advantage when working over very large datasets (e.g. N > 10 million).
TBB (and others like it) work at a higher level, focusing on the overall algorithm design, not just the leaf functions (like std::accumulate()).
Second alternative is to use OpenMP, which is supported by both GCC and
Clang, though is not STL by any means, but is cross-platform.
Third alternative is to use Grand Central Dispatch - the official multicore API in OSX, again hardly STL.
Forth alternative is to wait for C++17, it will have Parallelism module.

Fastest math programming language?

I have an application that requires millions of subtractions and remainders, i originally programmed this algorithm inside of C#.Net but it takes five minutes to process this information and i need it faster than that.
I have considered perl and that seems to be the best alternative now. Vb.net was slower in testing. C++ may be better also. Any advice would be greatly appreciated.
You need a compiled language like Fortran, C, or C++. Other languages are designed to give you flexibility, object-orientation, or other advantages, and assume absolutely fastest performance is not your highest priority.
Know how to get maximum performance out of a single thread, and after you have done so investigate sharing the work across multiple cores, for example with MPI. To get maximum performance in a single thread, one thing I do is single-step it at the machine instruction level, to make sure it's not dawdling about in stuff that could be removed.
Some calculations are regular enough to take profit of GPGPUs: recent graphic cards are essentially specialized massively parallel numerical co-processors. For instance, you could code your numerical kernels in OpenCL. Otherwise, learn C++11 (not some earlier version of the C++ standard) or C. And in many cases Ocaml could be nearly as fast as C++ but much easier to code with.
Perhaps your problem can be handled by scilab or R, I did not understand it enough to help more.
And you might take advantage of your multi-core processor by e.g. using Pthreads or MPI
At last, the Linux operating system is perhaps better to deal with massive calculations. It is significant that most super computers use it today.
If execution speed is the highest priority, that usually means Fortran.
Try Julia: its killing feature is being easy to code in a high level concise way, while keeping performances at the same order of magnitude of Fortran/C.
PARI/GP is the best I have used so far. It's written in C.
Try to look at DMelt mathematical program. The program calls Java libraries. Java virtual machine can optimize long mathematical calculations for you.
The standard tool for mathmatic numerical operations in engineering is often Matlab (or as free alternatives octave or the already mentioned scilab).

When should I consider the performance impact of a function call?

In a recent conversation with a fellow programmer, I asserted that "if you're writing the same code more than once, it's probably a good idea to refactor that functionality such that it can be called once from each of those places."
My fellow programmer buddy instead insisted that the performance impact of making these function calls was not acceptable.
Now, I'm not looking for validation of who was right. I'm simply curious to know if there are situations or patterns where I should consider the performance impact of a function call before refactoring.
"My fellow programmer buddy instead insisted that the performance impact of making these function calls was not acceptable."
...to which the proper answer is "Prove it."
The old saw about premature optimization applies here. Anyone who isn't familiar with it needs to be educated before they do any more harm.
IMHO, if you don't have the attitude that you'd rather spend a couple hours writing a routine that can be used for both than 10 seconds cutting and pasting code, you don't deserve to call yourself a coder.
Don't even consider the effect of calling overhead if the code isn't in a loop that's being called millions of times, in an area where the user is likely to notice the difference. Once you've met those conditions, go ahead and profile to see if your worries are justified.
Modern compilers of languages such as Java will inline certain function calls anyway. My opinion is that the design is way more important over the few instructions spent with function call. The only situation I can think about would be writing some really fine tuned code in assembler.
You need to ask yourself several questions:
Cost of time spent on optimizing code vs cost of throwing more hardware at it.
How does this impact maintainability?
How does going in either direction impact your deadline?
Does this really beg optimization when many modern compilers will do it for you anyway? Do not try to outsmart the compiler.
And of course, which will help you sleep better at night? :)
My bet is that there was a time in which the performance cost of a call to an external method or function WAS something to be concerned with, in the same way that the lengths of variable names and such all needed to be evaluated with respect to performance implications.
With the monumental increases in processor speed and memory resources int he last two decades, I propose that these concerns are no longer as pertinent as they once were.
We have been able use long variable names without concern for some time, and the cost of a call to external code is probably negligible in most cases.
There might be exceptions. If you place a function call within a large loop, you may see some impact, depending upon the number of iterations.
I propose that in most cases you will find that refactoring code into discrete function calls will have a negligible impact. There might be occasions in which there IS an impact. However, proper TESTING of a refactoring will reveal this. In those minority of cases, your friend might be correct. For most of the rest of the time, I propose that your friend is clining a little to closely to practices which pre-date most modern processors and storage media.
You care about function call overhead the same time you care about any other overhead: when your performance profiling tool indicates that it's a problem.
for the c/c++ family:
the 'cost' of the call is not important. if it needs to be fast, you just have to make sure the compiler is able to inline it. that means that:
the body must be visible to the compiler
the body is indeed small enough to be considered an inline candidate.
the method does not require dynamic dispatch
there are a few ways to break this default ability. for example:
huge instruction count already in the callsite. even with early inlining, the compiler may pop a trivial function out of line (even though it could generate more instructions/slower execution). early inlining is the compiler's ability to inline a function early on, when it sees the call costs more than the inline.
recursion
the inline keyword is more or less useless in this era, regarding its original intent. however, many compilers offer a means to restore the meaning, with a compiler specific directive. using this directive (correctly) helps considerably. learning how to use it correctly takes time. if in doubt, omit the directive and leave it up to the compiler.
assuming you are using a modern compiler, there is no excuse to avoid the function, unless you're also willing to go down to assembly for this particular program.
as it stands, and if performance is crucial, you really have two choices:
1) learn to write well organized programs for speed. downside: longer compile times
2) maintain a poorly written program
i prefer 1. any day.
(yes, i have spent a lot of time writing performance critical programs)

Resources