Primitives revisited - performance

Primitives revisited - performance

I am well aware of Stack Overflow question What are the primitive Forth operators?, but it doesn't really address my question. I am looking not for the minimal but rather practical set of primitives.
Recently I faced a problem which required frequently sorting quite large arrays, and the performance became critical. A naive qsort benchmarked at 20. Porting a heavily (algorithmically) optimized STL version gain me benchmark 16. Native C++ laughed at me from benchmark 3. Oh well.
Finally I bit a bullet and implemented EXCH ( a1 a2 -- a1 a2 ) and non-destructive compares ( n1 n2 -- n1 n2 flag ) as primitives. The results were amazing - three-fold performance gain. Still not C++, but way closer.
Why doesn't standard Forth have them out of the box?
PS: the benchmark is (execution time, nsec)/(n log n)

The effect of such changes depend heavily on the quality of your Forth system. Obviously the worse the compiler is, the more effect well-thought out changes will have. On the other hand, it is more difficult to shave off 1 cycle of 4, than 10 cycles of 40. This means that at some point high-level rewrites do not pay off anymore (unless you are a compiler writer :-)
There are of course tricks with multi-threading and special CPU instructions that one might experiment with.
To see where you are, it would be helpful if you could provide actual code and timings on a real system.

I suspect that EXCH is not a part of standard Forth simply because it is obscure enough that you are probably better off writing your own if you need it.
I would imagine that non-destructive compares would count as a violation of the general principles of Forth, specifically that words should consume their arguments. If you want to keep the arguments you have to explicitly create a copy.
I don't know enough about implementations to say what sort of performance impact it has, but for most applications
: non-destructive-> 2dup > ;
would make sense and work well enough.
I realise that this is a slightly evasive answer, but I suspect that it is that way because from what I have read the choices behind what words should constitute a standard Forth were not made to optimise execution speed.

Related

How to accurately measure performance of sorting algorithms

I have a bunch of sorting algorithms in C I wish to benchmark. I am concerned regarding good methodology for doing so. Things that could affect benchmark performance include (but are not limited to): specific coding of the implementation, programming language, compiler (and compiler options), benchmarking machine and critically the input data and time measuring method. How do I minimize the effect of said variables on the benchmark's results?
To give you a few examples, I've considered multiple implementations on two different languages to adjust for the first two variables. Moreover I could compile the code with different compilers on fairly mundane (and specified) arguments. Now I'm going to be running the test on my machine, which features turbo boost and whatnot and often boosts a core running stuff to the moon. Of course I will be disabling that and doing multiple runs and likely taking their mean completion time to adjust for that as well. Regarding the input data, I will be taking different array sizes, from very small to relatively large. I do not know what the increments should ideally be like, and what the range of the elements should be as well. Also I presume duplicate elements should be allowed.
I know that theoretical analysis of algorithms accounts for all of these methods, but it is crucial that I complement my study with actual benchmarks. How would you go about resolving the mentioned issues, and adjust for these variables once the data is collected? I'm comfortable with the technologies I'm working with, less so with strict methodology for studying a topic. Thank you.

You can't benchmark abstract algorithms, only specific implementations of them, compiled with specific compilers running on specific machines.
Choose a couple different relevant compilers and machines (e.g. a Haswell, Ice Lake, and/or Zen2, and an Apple M1 if you can get your hands on one, and/or an AArch64 cloud server) and measure your real implementations. If you care about in-order CPUs like ARM Cortex-A53, measure on one of those, too. (Simulation with GEM5 or similar performance simulators might be worth trying. Also maybe relevant are low-power implementations like Intel Silvermont whose out-of-order window is much smaller, but also have a shorter pipeline so smaller branch mispredict penalty.)
If some algorithm allows a useful micro-optimization in the source, or that a compiler finds, that's a real advantage of that algorithm.
Compile with options you'd use in practice for the use-cases you care about, like clang -O3 -march=native, or just -O2.
Benchmarking on cloud servers makes it hard / impossible to get an idle system, unless you pay a lot for a huge instance, but modern AArch64 servers are relevant and may have different ratios of memory bandwidth vs. branch mispredict costs vs. cache sizes and bandwidths.
(You might well find that the same code is the fastest sorting implementation on all or most of the systems you test one.
Re: sizes: yes, a variety of sizes would be good.
You'll normally want to test with random data, perhaps always generated from the same PRNG seed so you're sorting the same data every time.
You may also want to test some unusual cases like already-sorted or almost-sorted, because algorithms that are extra fast for those cases are useful.
If you care about sorting things other than integers, you might want to test with structs of different sizes, with an int key as a member. Or a comparison function that does some amount of work, if you want to explore how sorts do with a compare function that isn't as simple as just one compare machine instruction.
As always with microbenchmarking, there are many pitfalls around warm-up of arrays (page faults) and CPU frequency, and more. Idiomatic way of performance evaluation?
taking their mean completion time
You might want to discard high outliers, or take the median which will have that effect for you. Usually that means "something happened" during that run to disturb it. If you're running the same code on the same data, often you can expect the same performance. (Randomization of code / stack addresses with page granularity usually doesn't affect branches aliasing each other in predictors or not, or data-cache conflict misses, but tiny changes in one part of the code can change performance of other code via effects like that if you're re-compiling.)
If you're trying to see how it would run when it has the machine to itself, you don't want to consider runs where something else interfered. If you're trying to benchmark under "real world" cloud server conditions, or with other threads doing other work in a real program, that's different and you'd need to come up with realistic other loads that use some amount of shared resources like L3 footprint and memory bandwidth.

Things that could affect benchmark performance include (but are not limited to): specific coding of the implementation, programming language, compiler (and compiler options), benchmarking machine and critically the input data and time measuring method.
Let's look at this from a very different perspective - how to present information to humans.
With 2 variables you get a nice 2-dimensional grid of results, maybe like this:
A = 1 A = 2
B = 1 4 seconds 2 seconds
B = 2 6 seconds 3 seconds
This is easy to display and easy for humans to understand and draw conclusions from (e.g. from my silly example table it's trivial to make 2 very different observations - "A=1 is twice as fast as A=2 (regardless of B)" and "B=1 is faster than B=2 (regardless of A)").
With 3 variables you get a 3-dimensional grid of results, and with N variables you get an N-dimensional grid of results. Humans struggle with "3-dimensional data on 2-dimensional screen" and more dimensions becomes a disaster. You can mitigate this a little by "peeling off" a dimension (e.g. instead of trying to present a 3D grid of results you could show multiple 2D grids); but that doesn't help humans much.
Your primary goal is to reduce the number of variables.
To reduce the number of variables:
a) Determine how important each variable is for what you intend to observe (e.g. "which algorithm" will be extremely important and "which language" will be less important).
b) Merge variables based on importance and "logical grouping". For example, you might get three "lower importance" variables (language, compiler, compiler options) and merge them into a single "language+compiler+options" variable.
Note that it's very easy to overlook a variable. For example, you might benchmark "algorithm 1" on one computer and benchmark "algorithm 2" on an almost identical computer, but overlook the fact that (even though both benchmarks used identical languages, compilers, compiler options and CPUs) one computer has faster RAM chips, and overlook "RAM speed" as a possible variable.
Your secondary goal is to reduce number of values each variable can have.
You don't want massive table/s with 12345678 million rows; and you don't want to spend the rest of your life benchmarking to generate such a large table.
To reduce the number of values each variable can have:
a) Figure out which values matter most
b) Select the right number of values in order of importance (and ignore/skip all other values)
For example, if you merged three "lower importance" variables (language, compiler, compiler options) into a single variable; then you might decide that 2 possibilities ("C compiled by GCC with -O3" and "C++ compiled by MSVC with -Ox") are important enough to worry about (for what you're intending to observe) and all of the other possibilities get ignored.
How do I minimize the effect of said variables on the benchmark's results?
How would you go about resolving the mentioned issues, and adjust for these variables once the data is collected?
By identifying the variables (as part of the primary goal) and explicitly deciding which values the variables may have (as part of the secondary goal).
You've already been doing this. What I've described is a formal method of doing what people would unconsciously/instinctively do anyway. For one example, you have identified that "turbo boost" is a variable, and you've decided that "turbo boost disabled" is the only value for that variable you care about (but do note that this may have consequences - e.g. consider "single-threaded merge sort without the turbo boost it'd likely get in practice" vs. "parallel merge sort that isn't as influenced by turning turbo boost off").
My hope is that by describing the formal method you gain confidence in the unconscious/instinctive decisions you're already making, and realize that you were very much on the right path before you asked the question.

swap two variables. which way is faster?

Let's say we have two integers a and b. which way is faster for swapping their values?
c=a;
a=b;
b=c;//(edited typo)
or
a=a+b;
b=a-b;
a=a-b;
or bitwise xor
a=a^b;
b=a^b;
a=a^b;
I'll test its performance differences when I'll be able but I'd like to know it now. Is it bitwise?

Firstly, you cannot quantify the speed of an algorithm independent of the program language, the compiler and the platform on which it is run. An algorithm is a mathematical abstraction.
Having said that:
for a typical programming language,
and a typical compiler, and
a typical execution platform,
the first version will typically be faster because it will typically compile to fewer native instructions that take less clock cycles to execute. The first version only requires load and save operations. The other two versions have (at least) the same number of loads and saves, and some additional arithmetic or bit manipulation instructions.
However, even that is not cut-and-dry.
The 2nd and 3rd examples are performing the swap without using a temporary variable. This is something you might do if using an extra temporary variable was expensive. This might happen on a machine which didn't provide enough general purpose registers, and the relative cost of loading / saving to memory was large. In some circumstances, the native code equivalents could be optimal.
However ... and this is the real point ... the best strategy is to leave this kind of decision to the compiler. Unless you are prepared to put a huge amount of effort into micro-optimizing, the compiler is likely to be able to a better job than you can. Indeed, writing code in "cunning ways" is liable to make it harder for the compiler to optimize. (In the 3rd case for example, the compiler would need to figure out that that sequence is actually swapping 2 variables before it can substitute the optimal instruction sequence. Chances are that the optimizer won't be able to do that.)

Tips and tricks on improving Fortran code performance [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
As part of my Ph.D. research, I am working on development of numerical models of atmosphere and ocean circulation. These involve numerically solving systems of PDE's on the order of ~10^6 grid points, over ~10^4 time steps. Thus, a typical model simulation takes hours to a few days to complete when run in MPI on dozens of CPUs. Naturally, improving model efficiency as much as possible is important, while making sure the results are byte-to-byte identical.
While I feel quite comfortable with my Fortran programming, and am aware of quite some tricks to make code more efficient, I feel like there is still space to improve, and tricks that I am not aware of.
Currently, I make sure I use as few divisions as possible, and try not to use literal constants (I was taught to do this from very early on, e.g. use half=0.5 instead of 0.5 in actual computations), use as few transcendental functions as possible etc.
What other performance sensitive factors are there? At the moment, I am wondering about a few:
1) Does the order of mathematical operations matter? For example if I have:
a=1E-7 ; b=2E4 ; c=3E13
d=a*b*c
would d evaluate with different efficiency based on the order of multiplication? Nowadays, this must be compiler specific, but is there a straight answer? I notice d getting (slightly) different value based on the order (precision limit), but will this impact the efficiency or not?
2) Passing lots (e.g. dozens) of arrays as arguments to a subroutine versus accessing these arrays from a module within the subroutine?
3) Fortran 95 constructs (FORALL and WHERE) versus DO and IF? I know that these mattered back in the 90's when code vectorization was a big thing, but is there any difference now with modern compilers being able to vectorize explicit DO loops? (I am using PGI, Intel, and IBM compilers in my work)
4) Raising a number to an integer power versus multiplication? E.g.:
b=a**4
or
b=a*a*a*a
I have been taught to always use the latter where possible. Does this affect efficiency and/or precision? (probably compiler dependent as well)
Please discuss and/or add any tricks and tips that you know about improving Fortran code efficiency. What else is out there? If you know anything specific to what each of the compilers above do related to this question, please include that as well.
Added: Note that I do not have any bottlenecks or performance issues per se. I am asking if there are any general rules for optimizing the code in sense of operations.
Thanks!

Sorry but all the tricks you mentioned are simply ... ridiculous. More exactly, they have no meaning in practice. For instance:
what could be the advantage of using half(=0.5) instead of 0.5?
idem for computing a**4 or a*a*a*a. (a*a)** 2 would be another possibility too. My personal taste is a**4 because a good compiler which choose automatically the best way.
For **, the only point which could matter is the difference between a ** 4 and a ** 4., the latter being much more CPU time consuming. But even this point has no sense without a measurement in an actual simulation.
In fact, your approach is wrong. Develop your code as well as possible. After that, measure objectively the cost of the different parts of your code. Optimizing without measuring before is simply non sense.
If a part exhibits a high percentage of the CPU, 50% for instance, don't forget that optimizing that part only cannot divide the cost of the overall code by a factor greater than two. Any way, start the optimization work by the most expensive part (the bottle neck).
Don't forget also that the main improvements are generally coming from better algorithms.

I second the advice that these tricks that you have been taught are silly in this era. Compilers do this for you now; such micro-optimizations are unlikely to make a significant difference and may not be portable. Write clear & understandable code. Carefully select your algorithm. One thing that can make a difference is using indices of multi-dimensions arrays in the correct order ... recasting an M X N array to N X M can help depending on the pattern of data access by your program. After this, if your program is too slow, measure where the CPU is consumed and improve only those parts. Experience shows that guessing is frequently wrong and leads to writing more opaque code for nor reason. If you make a code section in which your program spends 1% of its time twice as fast, it won't make any difference.
Here are previous answers on FORALL and WHERE: How can I ensure that my Fortran FORALL construct is being parallelized? and Do Fortran 95 constructs such as WHERE, FORALL and SPREAD generally result in faster parallel code?

You've got a-priori ideas about what to do, and some of them might actually help,
but the biggest payoff is in a-posteriori anaylsis.
(Added: In other words, getting a*b*c into a different order might save a couple cycles (which I doubt), while at the same time you don't know you're not getting blind-sided by something spending 1000 cycles for no good reason.)
No matter how carefully you code it, there will be opportunities for speedup that you didn't foresee. Here's how I find them. (Some people consider this method controversial).
It's best to start with optimization flags OFF when you do this, so the code isn't all scrambled.
Later you can turn them on and let the compiler do its thing.
Get it running under a debugger with enough of a workload so it runs for a reasonable length of time.
While it's running, manually interrupt it, and take a good hard look at what it's doing and why.
Do this several times, like 10, so you don't draw erroneous conclusions about what it's spending time at.
Here's examples of things you might find:
It could be spending a large fraction of time calling math library functions unnecessarily due to the way some expressions were coded, or with the same argument values as in prior calls.
It could be spending a large fraction of time doing some file I/O, or opening/closing a file, deep inside some routine that seemed harmless to call.
It could be in a general-purpose library function, calling a subordinate subroutine, for the purpose of checking argument flags to the upper function. In such a case, much of that time might be eliminated by writing a special-purpose function and calling that instead.
If you do this entire operation two or three times, you will have removed the stupid stuff that finds its way into any software when it's first written.
After that, you can turn on the optimization, parallelism, or whatever, and be confident no time is being spent on silly stuff.

Reasoning about performance in Haskell

The following two Haskell programs for computing the n'th term of the Fibonacci sequence have greatly different performance characteristics:
fib1 n =
case n of
0 -> 1
1 -> 1
x -> (fib1 (x-1)) + (fib1 (x-2))
fib2 n = fibArr !! n where
fibArr = 1:1:[a + b | (a, b) <- zip fibArr (tail fibArr)]
They are very close to mathematically identical, but fib2 uses the list notation to memoize its intermediate results, while fib1 has explicit recursion. Despite the potential for the intermediate results to be cached in fib1, the execution time gets to be a problem even for fib1 25, suggesting that the recursive steps are always evaluated. Does referential transparency contribute anything to Haskell's performance? How can I know ahead of time if it will or won't?
This is just an example of the sort of thing I'm worried about. I'd like to hear any thoughts about overcoming the difficulty inherent in reasoning about the performance of a lazily-executed, functional programming language.
Summary: I'm accepting 3lectrologos's answer, because the point that you don't reason so much about the language's performance, as about your compiler's optimization, seems to be extremely important in Haskell - more so than in any other language I'm familiar with. I'm inclined to say that the importance of the compiler is the factor that differentiates reasoning about performance in lazy, functional langauges, from reasoning about the performance of any other type.
Addendum: Anyone happening on this question may want to look at the slides from Johan Tibell's talk about high performance Haskell.

In your particular Fibonacci example, it's not very hard to see why the second one should run faster (although you haven't specified what f2 is).
It's mainly an algorithmic issue:
fib1 implements the purely recursive algorithm and (as far as I know) Haskell has no mechanism for "implicit memoization".
fib2 uses explicit memoization (using the fibArr list to store previously computed values.
In general, it's much harder to make performance assumptions for a lazy language like Haskell, than for an eager one. Nevertheless, if you understand the underlying mechanisms (especially for laziness) and gather some experience, you will be able to make some "predictions" about performance.
Referential transparency increases (potentially) performance in (at least) two ways:
First, you (as a programmer) can be sure that two calls to the same function will always return the same, so you can exploit this in various cases to benefit in performance.
Second (and more important), the Haskell compiler can be sure for the above fact and this may enable many optimizations that can't be enabled in impure languages (if you've ever written a compiler or have any experience in compiler optimizations you are probably aware of the importance of this).
If you want to read more about the reasoning behind the design choices (laziness, pureness) of Haskell, I'd suggest reading this.

Reasoning about performance is generally hard in Haskell and lazy languages in general, although not impossible. Some techniques are covered in Chris Okasaki's Purely Function Data Structures (also available online in a previous version).
Another way to ensure performance is to fix the evaluation order, either using annotations or continuation passing style. That way you get to control when things are evaluated.
In your example you might calculate the numbers "bottom up" and pass the previous two numbers along to each iteration:
fib n = fib_iter(1,1,n)
where
fib_iter(a,b,0) = a
fib_iter(a,b,1) = a
fib_iter(a,b,n) = fib_iter(a+b,a,n-1)
This results in a linear time algorithm.
Whenever you have a dynamic programming algorithm where each result relies on the N previous results, you can use this technique. Otherwise you might have to use an array or something completely different.

Your implementation of fib2 uses memoization but each time you call fib2 it rebuild the "whole" result. Turn on ghci time and size profiling:
Prelude> :set +s
If it was doing memoisation "between" calls the subsequent calls would be faster and use no memory. Call fib2 20000 twice and see for yourself.
By comparison a more idiomatic version where you define the exact mathematical identity:
-- the infinite list of all fibs numbers.
fibs = 1 : 1 : zipWith (+) fibs (tail fibs)
memoFib n = fibs !! n
actually do use memoisation, explicit as you see. If you run memoFib 20000 twice you'll see the time and space taken the first time then the second call is instantaneous and take no memory. No magic and no implicit memoization like a comment might have hinted at.
Now about your original question: optimizing and reasoning about performance in Haskell...
I wouldn't call myself an expert in Haskell, I have only been using it for 3 years, 2 of which at my workplace but I did have to optimize and get to understand how to reason somewhat about its performance.
As mentionned in other post laziness is your friend and can help you gain performance however YOU have to be in control of what is lazily evaluated and what is strictly evaluated.
Check this comparison of foldl vs foldr
foldl actually stores "how" to compute the value i.e. it is lazy. In some case you saves time and space beeing lazy, like the "infinite" fibs. The infinite "fibs" doesn't generate all of them but knows how. When you know you will need the value you might as well just get it "strictly" speaking... That's where strictness annotation are usefull, to give you back control.
I recall reading many times that in lisp you have to "minimize" consing.
Understanding what is stricly evaluated and how to force it is important but so is understanding how much "trashing" you do to the memory. Remember Haskell is immutable, that means that updating a "variable" is actually creating a copy with the modification. Prepending with (:) is vastly more efficient than appending with (++) because (:) does not copy memory contrarily to (++). Whenever a big atomic block is updated (even for a single char) the whole block needs to be copied to represent the "updated" version. The way you structure data and update it can have a big impact on performance. The ghc profiler is your friend and will help you spot these. Sure the garbage collector is fast but not having it do anything is faster!
Cheers

Aside from the memoization issue, fib1 also uses non-tailcall recursion. Tailcall recursion can be re-factored automatically into a simple goto and perform very well, but the recursion in fib1 cannot be optimized in this way, because you need the stack frame from each instance of fib1 in order to calculate the result. If you rewrote fib1 to pass a running total as an argument, thus allowing a tail call instead of needing to keep the stack frame for the final addition, the performance would improve immensely. But not as much as the memoized example, of course :)

Since allocation is a major cost in any functional language, an important part of understanding performance is to understand when objects are allocated, how long they live, when they die, and when they are reclaimed. To get this information you need a heap profiler. It's an essential tool, and luckily GHC ships with a good one.
For more information, read Colin Runciman's papers.

Optimization! - What is it? How is it done?

Its common to hear about "highly optimized code" or some developer needing to optimize theirs and whatnot. However, as a self-taught, new programmer I've never really understood what exactly do people mean when talking about such things.
Care to explain the general idea of it? Also, recommend some reading materials and really whatever you feel like saying on the matter. Feel free to rant and preach.

Optimize is a term we use lazily to mean "make something better in a certain way". We rarely "optimize" something - more, we just improve it until it meets our expectations.
Optimizations are changes we make in the hopes to optimize some part of the program. A fully optimized program usually means that the developer threw readability out the window and has recoded the algorithm in non-obvious ways to minimize "wall time". (It's not a requirement that "optimized code" be hard to read, it's just a trend.)
One can optimize for:
Memory consumption - Make a program or algorithm's runtime size smaller.
CPU consumption - Make the algorithm computationally less intensive.
Wall time - Do whatever it takes to make something faster
Readability - Instead of making your app better for the computer, you can make it easier for humans to read it.
Some common (and overly generalized) techniques to optimize code include:
Change the algorithm to improve performance characteristics. If you have an algorithm that takes O(n^2) time or space, try to replace that algorithm with one that takes O(n * log n).
To relieve memory consumption, go through the code and look for wasted memory. For example, if you have a string intensive app you can switch to using Substring References (where a reference contains a pointer to the string, plus indices to define its bounds) instead of allocating and copying memory from the original string.
To relieve CPU consumption, cache as many intermediate results if you can. For example, if you need to calculate the standard deviation of a set of data, save that single numerical result instead looping through the set each time you need to know the std dev.

I'll mostly rant with no practical advice.
Measure First. Optimization should be done to places where it matters. Highly optimized code is often difficult to maintain and a source of problems. In places where the code does not slow down execution anyway, I alwasy prefer maintainability to optimizations. Familiarize yourself with Profiling, both intrusive (instrumented) and non-intrusive (low overhead statistical). Learn to read a profiled stack, understand where the time inclusive/time exclusive is spent, why certain patterns show up and how to identify the trouble spots.
You can't fix what you cannot measure. Have your program report through some performance infrastructure the thing it does and the times it takes. I come from a Win32 background so I'm used to the Performance Counters and I'm extremely generous at sprinkling them all over my code. I even automatized the code to generate them.
And finally some words about optimizations. Most discussion about optimization I see focus on stuff any compiler will optimize for you for free. In my experience the greatest source of gains for 'highly optimized code' lies completely elsewhere: memory access. On modern architectures the CPU is idling most of the times, waiting for memory to be served into its pipelines. Between L1 and L2 cache misses, TLB misses, NUMA cross-node access and even GPF that must fetch the page from disk, the memory access pattern of a modern application is the single most important optimization one can make. I'm exaggerating slightly, of course there will be counter example work-loads that will not benefit memory access locality this techniques. But most application will. To be specific, what these techniques mean is simple: cluster your data in memory so that a single CPU can work an a tight memory range containing all it needs, no expensive referencing of memory outside your cache lines or your current page. In practice this can mean something as simple as accessing an array by rows rather than by columns.
I would recommend you read up the Alpha-Sort paper presented at the VLDB conference in 1995. This paper presented how cache sensitive algorithms designed specifically for modern CPU architectures can blow out of the water the old previous benchmarks:
We argue that modern architectures
require algorithm designers to
re-examine their use of the memory
hierarchy. AlphaSort uses clustered
data structures to get good cache
locality...

The general idea is that when you create your source tree in the compilation phase, before generating the code by parsing it, you do an additional step (optimization) where, based on certain heuristics, you collapse branches together, delete branches that aren't used or add extra nodes for temporary variables that are used multiple times.
Think of stuff like this piece of code:
a=(b+c)*3-(b+c)
which gets translated into
-
* +
+ 3 b c
b c
To a parser it would be obvious that the + node with its 2 descendants are identical, so they would be merged into a temp variable, t, and the tree would be rewritten:
-
* t
t 3
Now an even better parser would see that since t is an integer, the tree could be further simplified to:
*
t 2
and the intermediary code that you'd run your code generation step on would finally be
int t=b+c;
a=t*2;
with t marked as a register variable, which is exactly what would be written for assembly.
One final note: you can optimize for more than just run time speed. You can also optimize for memory consumption, which is the opposite. Where unrolling loops and creating temporary copies would help speed up your code, they would also use more memory, so it's a trade off on what your goal is.

Here is an example of some optimization (fixing a poorly made decision) that I did recently. Its very basic, but I hope it illustrates that good gains can be made even from simple changes, and that 'optimization' isn't magic, its just about making the best decisions to accomplish the task at hand.
In an application I was working on there were several LinkedList data structures that were being used to hold various instances of foo.
When the application was in use it was very frequently checking to see if the LinkedListed contained object X. As the ammount of X's started to grow, I noticed that the application was performing more slowly than it should have been.
I ran an profiler, and realized that each 'myList.Contains(x)' call had O(N) because the list has to iterate through each item it contains until it reaches the end or finds a match. This was definitely not efficent.
So what did I do to optimize this code? I switched most of the LinkedList datastructures to HashSets, which can do a '.Contains(X)' call in O(1)- much better.

This is a good question.
Usually the best practice is 1) just write the code to do what you need it to do, 2) then deal with performance, but only if it's an issue. If the program is "fast enough" it's not an issue.
If the program is not fast enough (like it makes you wait) then try some performance tuning. Performance tuning is not like programming. In programming, you think first and then do something. In performance tuning, thinking first is a mistake, because that is guessing.
Don't guess what to fix; diagnose what the program is doing.
Everybody knows that, but mostly they do it anyway.
It is natural to say "Could be the problem is X, Y, or Z" but only the novice acts on guesses. The pro says "but I'm probably wrong".
There are different ways to diagnose performance problems.
The simplest is just to single-step through the program at the assembly-language level, and don't take any shortcuts. That way, if the program is doing unnecessary things, then you are doing the same things, and it will become painfully obvious.
Another is to get a profiling tool, and as others say, measure, measure, measure.
Personally I don't care for measuring. I think it's a fuzzy microscope for the purpose of pinpointing performance problems. I prefer this method, and this is an example of its use.
Good luck.
ADDED: I think you will find, if you go through this exercise a few times, you will learn what coding practices tend to result in performance problems, and you will instinctively avoid them. (This is subtly different from "premature optimization", which is assuming at the beginning that you must be concerned about performance. In fact, you will probably learn, if you don't already know, that premature concern about performance can well cause the very problem it seeks to avoid.)

Optimizing a program means: make it run faster
The only way of making the program faster is making it do less:
find an algorithm that uses fewer operations (e.g. N log N instead of N^2)
avoid slow components of your machine (keep objects in cache instead of in main memory, or in main memory instead of on disk); reducing memory consumption nearly always helps!
Further rules:
In looking for optimization opportunities, adhere to the 80-20-rule: 20% of typical program code accounts for 80% of execution time.
Measure the time before and after every attempted optimization; often enough, optimizations don't.
Only optimize after the program runs correctly!
Also, there are ways to make a program appear to be faster:
separate GUI event processing from back-end tasks; priorize user-visible changes against back-end calculation to keep the front-end "snappy"
give the user something to read while performing long operations (every noticed the slideshows displayed by installers?)

However, as a self-taught, new programmer I've never really understood what exactly do people mean when talking about such things.
Let me share a secret with you: nobody does. There are certain areas where we know mathematically what is and isn't slow. But for the most part, performance is too complicated to be able to understand. If you speed up one part of your code, there's a good possibility you're slowing down another.
Therefore, anyone who tells you that one method is faster than another, there's a good possibility they're just guessing unless one of three things are true:
They have data
They're choosing an algorithm that they know is faster mathematically.
They're choosing a data structure that they know is faster mathematically.

Optimization means trying to improve computer programs for such things as speed. The question is very broad, because optimization can involve compilers improving programs for speed, or human beings doing the same.

I suggest you read a bit of theory first (from books, or Google for lecture slides):
Data structures and algorithms - what the O() notation is, how to calculate it,
what datastructures and algorithms can be used to lower the O-complexity
Book: Introduction to Algorithms by Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest
Compilers and assembly - how code is translated to machine instructions
Computer architecture - how the CPU, RAM, Cache, Branch predictions, out of order execution ... work
Operating systems - kernel mode, user mode, scheduling processes/threads, mutexes, semaphores, message queues
After reading a bit of each, you should have a basic grasp of all the different aspects of optimization.
Note: I wiki-ed this so people can add book recommendations.

I am going with the idea that optimizing a code is to get the same results in less time. And fully optimized only means they ran out of ideas to make it faster. I throw large buckets of scorn on claims of "fully optimized" code! There's no such thing.
So you want to make your application/program/module run faster? First thing to do (as mentioned earlier) is measure also known as profiling. Do not guess where to optimize. You are not that smart and you will be wrong. My guesses are wrong all the time and large portions of my year are spent profiling and optimizing. So get the computer to do it for you. For PC VTune is a great profiler. I think VS2008 has a built in profiler, but I haven't looked into it. Otherwise measure functions and large pieces of code with performance counters. You'll find sample code for using performance counters on MSDN.
So where are your cycles going? You are probably waiting for data coming from main memory. Go read up on L1 & L2 caches. Understanding how the cache works is half the battle. Hint: Use tight, compact structures that will fit more into a cache-line.
Optimization is lots of fun. And it's never ending too :)
A great book on optimization is Write Great Code: Understanding the Machine by Randall Hyde.

Make sure your application produces correct results before you start optimizing it.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio