Amdahl's Law examples - parallel-processing

Amdahl's Law examples - parallel-processing

Amdahl's Law states that the maximal speedup of a computation where the fraction S of the computation must be done sequentially going from a 1 processor system to an N processor system is at most
1 / (S + [(1 - S) / N])
Does anyone know of books or notes where the actual analysis of the code, for some non-trivial computation, for determining the fraction S is done ?

There is a very good discussion of Amdahl's law in the Microsoft Patterns and Practices book on Parallel Programming with .NET.
Doing a detailed analysis of the code is going to be quite difficult - as every situation is unique.
However, it should be something that can be easily approximated, provided you have the mechanisms to determine the amount of concurrency. By changing the usable concurrency and profiling, you should be able to estimate S by solving the equation in reverse.

Here are a few links that might help you out:
Microsoft Research | Discussion of Amdahl's Law in the Multicore Age
Evaluation and Analysis of Amadahl's (and Gustafson's) Laws
Abstract on Estimating Speed Multipliers | Discusses Amdahl's Law Briefly
Hope they help you out.

The principle involved isn't unique to parallelization. If 25% of the program's time is spent doing some particular operation, making everything but that 25% happen instantly (without affecting the 25%) would leave a program that took 25% of the original time, and was thus four times as fast.
In cases where an algorithm has clear phases that are or are not parallelizable, the application of the above formula would be simple--figure that N way parallelization will make the parts that are parallelizable run N times as fast, while the parts that aren't parallelizable will run at normal speed. In practice, I don't think most algorithms consist entirely of parts that are either 100% parallelizable or 100% sequential. In most interesting cases, algorithms can run largely in parallel, but have various ordering constraints; in some cases, the precise ordering constraints may be data-dependent. As such, the "percentage of parallelization" could be variable based upon the number of processors, among other things, and so trying to plug it into a formula would not be very helpful.

Related

Why is algorithm time complexity often defined in terms of steps/operations?

I've been doing a lot of studying from many different resources on algorithm analysis lately, and one thing I'm currently confused about is why time complexity is often defined in terms of the number of steps/operations an algorithm performs.
For instance, in Introduction to Algorithms, 3rd Edition by Cormen, he states:
The running time of an algorithm on a particular input is the number of primitive operations or “steps” executed. It is convenient to deﬁne the notion of step so that it is as machine-independent as possible.
I've seen other resources define the time complexity as such as well. I have a problem with this because, for one, it's called TIME complexity, not "step complexity" or "operations complexity." Secondly, while it's not a definitive source, an answer to a post here on Stackoverflow states "Running time is how long it takes a program to run. Time complexity is a description of the asymptotic behavior of running time as input size tends to infinity." Further, on the Wikipedia page for time complexity it states "In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm." Again, these are definitive sources, things makes logical sense using these definitions.
When analyzing an algorithm and deriving its time complexity function, such as in Figure 1 below, you get an equation that is in units of time. It CAN represent the amount of operations the algorithm performs, but only if those constant factors (C_1, C_2, C_3, etc.) are each a value of 1.
Figure 1
So with all that said, I'm just wondering how it's possible for this to be defined as the number of steps when that's not really what it represents. I'm trying to clear things up and make the connection between time and number of operations. I feel like there is a lot of information that hasn't been explicitly stated in the resources I've studied. Hoping someone can help clear things up for me, and without going into discussion about Big-O because that shouldn't be needed and misses the point of the question, in my opinion.
Thank you everyone for your time and help.

why time complexity is often defined in terms of the number of steps/operations an algorithm performs?
TL;DR: because that is how the asymptotic analysis work; also, do not forget, that time is a relative thing.
Longer story:
Measuring the performance in time, as we, humans understand the time in a daily use, doesn't make much sense, as it is not always that trivial task to do.. furthermore - it even makes no sense in a broader perspective.
How would you measure what is the space and time your algorithm takes? what will be the conditional and predefined unit of the measurement you're going to apply to see the running time/space complexity of your algorithm?
You can measure it on your clock, or use some libraries/API to see exactly how many seconds/minutes/megabytes your algorithm took.. or etc.
However, this all will be VERY much variable! because, the time/space your algorithm took, will depend on:
Particular hardware you're using (architecture, CPU, RAM, etc.);
Particular programming language;
Operating System;
Compiler, you used to compile your high-level code into lower abstraction;
Other environment-specific details (sometimes, even on the temperature.. as CPUs might be scaling operating frequency dynamically)..
therefore, it is not the good thing to measure your complexity in the precise timing (again, as we understand the timing on this planet).
So, if you want to know the complexity (let's say time complexity) of your algorithm, why would it make sense to have a different time for different machines, OSes, and etc.? Algorithm Complexity Analysis is not about measuring runtime on a particular machine, but about having a clear and mathematically defined precise boundaries for the best, average and worst cases.
I hope this makes sense.
Fine, we finally get to the point, that algorithm analysis should be done as a standalone, mathematical complexity analysis.. which would not care what is the machine, OS, system architecture, or anything else (apart from algorithm itself), as we need to observe the logical abstraction, without caring about whether you're running it on Windows 10, Intel Core2Duo, or Arch Linux, Intel i7, or your mobile phone.
What's left?
Best (so far) way for the algorithm analysis, is to do the Asymptotic Analysis, which is an abstract analysis calculated on the basis of input.. and that is counting almost all the steps and operations performed in the algorithm, proportionally to your input.
This way you can speak about the Algorithm, per se, instead of being dependent on the surrounding circumstances.
Moreover; not only we shouldn't care about machine or peripheral factors, we also shouldn't care about Lower Order Terms and Constant Factors in the mathematical expression of the Asymptotic Analysis.
Constant Factors:
Constant Factors are instructions which are independent from the Input data. i.e. which are NOT dependent on the input argument data.
Few reasons why you should ignore them are:
Different programming language syntaxes, as well as their compiled files, will have different number of constant operations/factors;
Different Hardware will give different run-time for the same constant factors.
So, you should eliminate thinking about analyzing constant factors and overrule/ignore them. Only focus on only input-related important factors; therefore:
O(2n) == O(5n) and all these are O(n);
6n2 == 10n2 and all these are n2.
One more reason why we won't care about constant factors is that they we usually want to measure the complexity for sufficiently large inputs.. and when the input grows to the + infinity, it really makes no sense whether you have n or 2n.
Lower order terms:
Similar concept applies in this point:
Lower order terms, by definition, become increasingly irrelevant as you focus on large inputs.
When you have 5x4+24x2+5, you will never care much on exponent that is less than 4.

Time complexity is not about measuring how long an algorithm takes in terms of seconds. It's about comparing different algorithms, how they will perform with a certain amount if input data. And how this performance develops when the input data gets bigger.
In this context, the "number of steps" is an abstract concept for time, that can be compared independently from any hardware. Ie you can't tell how long it will take to execute 1000 steps, without exact specifications of your hardware (and how long one step will take). But you can always tell, that executing 2000 steps will take about twice as long as executing 1000 steps.
And you can't really discuss time complexity without going into Big-O, because that's what it is.

You should note that Algorithms are more abstract than programs. You check two algorithms on a paper or book and you want to analyze which works faster for an input data of size N. So you must analyze them with logic and statements. You can also run them on a computer and measure the time, but that's not proof.
Moreover, different computers execute programs at different speeds. It depends on CPU speed, RAM, and many other conditions. Even a program on a single computer may be run at different speeds depending on available resources at a time.
So, time for algorithms must be independent of how long a single atomic instruction takes to be executed on a specific computer. It's considered just one step or O(1). Also, we aren't interested in constants. For example, it doesn't matter if a program has two or 10 instructions. Both will be run on a fraction of microseconds. Usually, the number of instructions is limited and they are all run fast on computers. What is important are instructions or loops whose execution depends on a variable, which could be the size of the input to the program.

What is the difference between Work, Span and Time in parallel algorithm analysis?

When analysing parallel algorithms, we tend to focus Work(T1), Span(T∞) or time.
What I'm confused about is that if I was given an algorithm to analyse, what key hints would I need to look for, for Work, span and time?
Suppose this algorithm:
How do I analyse the above algorithm to find the Work, Time and span?

Origin:
PRAMs have been introduced in early 70-ies last century, in a hope it may allow to jump ahead a performance in tackling computationally hard problems. Yet, the promises or better expectations were cooled down by principal limitations these computing-device architectures still have to live with.
Theory:
T1 = amount of time for a processing, measured once execution done in a pure-[SERIAL] schedule.
T∞ = amount of time for processing a computing-graph ( Directed, just hopefullyoften forgotten finally Acyclic, Graph ) once execution done in a "just"-[CONCURRENT] manner, but having an indeed infinite amount of real-resources available all that time, thus allowing for any degree parallelism to actually but just incidentally take place.
( WARNING NOTICE: your professors need not enjoy this interpretation, but reality rules -- infinite processors are simply not enough, as any other resource must also be present in infinite amounts & capabilities, be it RAM-accesses, IO-s, sensorics et al, so that all these must provide infinitely parallel services, avoiding any kind of blocking / waiting / re-scheduling that might have appeared due to any kind of any resource temporal / contextual in-ability to serve as asked, under an infinitely parallel amount of such service-requests, and answer "immediately" ).
How to tackle:
T1 for the above posted problem has imperatively ordered two O(N) blocks - memory allocations for M[:] and final search for Max over M[:], and two O(N2) blocks, processing "all pairs" (i,j) over a domain of N-by-N values.
Based on an assumption of a CIS/RIS homogenity, this Work will be no less than ~ 2N(1+N)
For T∞ there would be more things to do. First, detect what potential parallel code-execution paths may happen, next, also protecting the results from being "overwritten" in colliding moments - your headline just slightly mentions CRCW - a weak assumption to analyse the latter problem for a Concurrent-Read-Concurrent-Write PRAM-machine.
Do not hesitate to take a pencil, paper and draw the D(jh)AG for the smallest possible N == 2 ( 3, if you have a bit larger paper ), where one can derive the flow of operation ( and potentially (in)-dependency ordering for operations in case of a less forgiving but the more realistic CREW or EREW type of the PRAM-computing-devices ).
Criticism: an indeed Devil's part of the Lesson your professors will like the least
Any careful kind reader has already noted several nontrivial assumptions, a homogenity of CIS/RIS instruction durations being one minor case of these.
The biggest, yet hidden part of the problem, is the actual cost of process-scheduling. A pure-[SERIAL] code execution enjoys ( an unfair ) advantage of having zero add-on overhead costs ( plus on many silicon architectures, there are additional performance tricks arriving from out-of-order instruction re-ordered execution, ref. superscalar pipelined or VLIW architectures for indepth details ), while any sort of real-world process-scheduling principally adds additional overhead costs, that were not present in a pure-[SERIAL] code-execution case for getting the T1.
On real-world systems, having both NUMA impacted and non-homogeneous CIS/RIS instruction durations impacted remarkable code-execution flow durations' irregularities, these add-on overhead costs indeed dramatically shift the baseline for any speedup comparison.
The Devil's Question:
Where do we account for these real-world add-on costs?
In real life we do.
In original Amdahl's Law formulation and in the Brent's Theorem, we did not.
The re-formulated Amdahl's Law inputs also these { initial | coordination | terminal}-add-on overhead costs and suddenly the computed and experimentally validated speed-ups start to match the observations experienced on the commonly operated real-world computing fabrics.

Working through an example of Amdahl's Law with respect to percentage speedup

I am reading through "Computer Architechture: A Quantitative Approach 5th ed" and am trying to grasp Amdahl's law, when it comes to speeding up portions of the system i.e. speed up a certain percentage of the system by a certain percentage . It is easy to understand when you are talking about speeding up a system by a certain factor e.g. a system that is 10 times faster.
To give a concrete example:
You have a system, where a certain sub-system accounts for 70% of the execution time and you wish to develop a speedup which will improve the latency of this sub-system by 50%.
From the book, Amdahl's Law is listed as:
SpeedupOverall = 1/((1-FractionEnhanced)+(FractionEnhanced/SpeedupEnhanced))
I gather from the explanation of the Fractional Enhancement ("The fraction of the computation time in the original computer that can be converted to take advantage of the enhancement"), that: FractionEnhanced = 70% or 0.7.
My question here is how to reflect the speedup. The book lists it as "The improvement gained by the enhanced execution mode, that is, how much faster the task would run if the enhanced mode were used for the entire program". The book says that this would be the time that the original mode over the improvement time; in this case 70/50, or 1.4. However, where my confusion comes in is with this website, where by examining the java applet code, it seems that speedup enhanced would be 1 + the percentage speedup, or 1.5. Maybe I am overthinking this as well, but I am also thinking how it could also be .7/(0.7 - 0.7*0.5), or 2 (since, 70%*50% is the actual latency reduction, in terms of the actual sub-sbstem, right?).
Working the math out, we get the following answers:
For SpeedupEnhanced = 70/50 = 1.4: SpeedupOverall = 1/((1-0.7)+ .7/1.4) = 1.25
For SpeedupEnhanced = 1+0.5 = 1.5: SpeedupOverall = 1/((1-0.7)+ .7/1.5) = 1.3043
For SpeedupEnhanced = 0.7/(0.7-0.7*0.5) = 2: SpeedupOverall = 1/((1-0.7)+.7/2) = 1.54
Which one would be the correct speedup here? The second seems to make sense to me, but the book seems to imply that the first is correct. Any help by way of references or explanations as to how to grasp this type of speedup would be greatly appreciated.

The third answer (1.54x total speedup) is a correct one, because "speedup enhanced" is a non-dimensional value characterizing only enhanced part execution time change (in "x" units). Note that speedup enhanced is simply equal to 70/(70*0.5) in your case.
Overall there are lots of possible confusions wrt amdahl law and gustafson law interpretation. Surprisingly, good starting reading is wikipedia page about amdahl law. It will particularly remap you to more traditional interpretation which makes an emphasize on parallel computations and number of processors instead of "enhanced speedup" notion. Deeper and exhaustive reading could be found on reference page of professor Gustafson, who "invented" alternate "optimistical law". After studying all the materials, you will find that the notion of "speedup" is even more interesting and unambigous.

What can be parameters other than time and space while analyzing certain algorithms?

I was interested to know about parameters other than space and time during analysing the effectiveness of an algorithms. For example, we can focus on the effective trap function while developing encryption algorithms. What other things can you think of ?

First and foremost there's correctness. Make sure your algorithm always works, no matter what the input. Even for input that the algorithm is not designed to handle, you should print an error mesage, not crash the entire application. If you use greedy algorithms, make sure they truly work in every case, not just a few cases you tried by hand.
Then there's practical efficiency. An O(N2) algorithm can be a lot faster than an O(N) algorithm in practice. Do actual tests and don't rely on theoretical results too much.
Then there's ease of implementation. You usually don't need the best intro sort implementation to sort an array of 100 integers once, so don't bother.
Look for worst cases in your algorithms and if possible, try to avoid them. If you have a generally fast algorithm but with a very bad worst case, consider detecting that worst case and solving it using another algorithm that is generally slower but better for that single case.
Consider space and time tradeoffs. If you can afford the memory in order to get better speeds, there's probably no reason not to do it, especially if you really need the speed. If you can't afford the memory but can afford to be slower, do that.
If you can, use existing libraries. Don't roll your own multiprecision library if you can use GMP for example. For C++, stuff like boost and even the STL containers and algorithms have been worked on for years by an army of people and are most likely better than you can do alone.

Stability (sorting) - Does the algorithm maintain the relative order of equal elements?
Numeric Stability - Is the algorithm prone to error when very large or small real numbers are used?
Correctness - Does the algorithm always give the correct answer? If not, what is the margin of error?
Generality - Does the algorithm work in many situation (e.g. with many different data types)?
Compactness - Is the program for the algorithm concise?
Parallelizability - How well does performance scale when the number of concurrent threads of execution are increased?
Cache Awareness - Is the algorithm designed to maximize use of the computer's cache?
Cache Obliviousness - Is the algorithm tuned for particulary cache-sizes / cache-line-sizes or does it perform well regardless of the parameters of the cache?

Complexity. 2 algorithms being the same in all other respects, the one that's much simpler is going to be a much better candidate for future customization and use.
Ease of parallelization. Depending on your use case, it might not make any difference or, on the other hand, make the algorithm useless because it can't use 10000 cores.

Stability - some algorithms may "blow up" with certain test conditions, e.g. take an inordinately long time to execute, or use an inordinately large amount of memory, or perhaps not even terminate.

For algorithms that perform floating point operations, the accumulation of round-off error is often a consideration.

Power consumption, for embedded algorithms (think smartcards).

One important parameter that is frequently measure in the analysis of algorithms is that of Cache hits and cache misses. While this is a very implementation and architecture dependent issue, it is possible to generalise somewhat. One particularly interesting property of the algorithm is being Cache-oblivious, which means that the algorithm will use the cache optimally on multiple machines with different cache sizes and structures without modification.

Time and space are the big ones, and they seem so plain and definitive, whereby they should often be qualified (1). The fact that the OP uses the word "parameter" rather than say "criteria" or "properties" is somewhat indicative of this (as if a big O value on time and on space was sufficient to frame the underlying algorithm).
Other criteria include:
domain of applicability
complexity
mathematical tractability
definitiveness of outcome
ease of tuning (may be tied to "complexity" and "tactability" afore mentioned)
ability of running the algorithm in a parallel fashion
(1) "qualified": As hinted in other answers, a -technically- O(n^2) algorithm may be found to be faster than say an O(n) algorithm, in 90% of the cases (which, btw, may turn out to be 100% of the practical cases)

worst case and best case are also interesting, especially when linked to some conditions in the input. if your input data shows some properties, an algorithm, by taking advantage of this property, may perform better that another algorithm which performs the same task but does not use that property.
for example, many sorting algorithm perform very efficiently when input are partially ordered in a specific way which minimizes the number of operations the algorithm has to execute.
(if your input is mostly sorted, an insertion sort will fit nicely, while you would never use that algorithm otherwise)

If we're talking about algorithms in general, then (in the real world) you might have to think about CPU/filesystem(read/write operations)/bandwidth usage.
True they are way down there in the list of things you need worry about these days, but given a massive enough volume of data and cheap enough infrastructure you might have to tweak your code to ease up on one or the other.

What you are interested aren’t parameters, rather they are intrinsic properties of an algorithm.
Anyway, another property you might be interested in, and analyse an algorithm for, concerns heuristics (or rather, approximation algorithms), i.e. algorithms which don’t find an exact solution but rather one that is (hopefully) good enough.
You can analyze how far a solution is from the theoretical optimal solution in the worst case. For example, an existing algorithm (forgot which one) approximates the optimal travelling salesman tour by a factor of two, i.e. in the worst case it’s twice as long as the optimal tour.
Another metric concerns randomized algorithms where randomization is used to prevent unwanted worst-case behaviours. One example is randomized quicksort; quicksort has a worst-case running time of O(n2) which we want to avoid. By shuffling the array beforehand we can avoid the worst-case (i.e. an already sorted array) with a very high probability. Just how high this probability is can be important to know; this is another intrinsic property of the algorithm that can be analyzed using stochastic.

For numeric algorithms, there's also the property of continuity: that is, whether if you change input slightly, output also changes only slightly. See also Continuity analysis of programs on Lambda The Ultimate for a discussion and a link to an academical paper.
For lazy languages, there's also strictness: f is called strict if f _|_ = _|_ (where _|_ denotes the bottom (in the sense of domain theory), a computation that can't produce a result due to non-termination, errors etc.), otherwise it is non-strict. For example, the function \x -> 5 is non-strict, because (\x -> 5) _|_ = 5, whereas \x -> x + 1 is strict.
Another property is determinicity: whether the result of the algorithm (or its other properties, such as running time or space consumption) depends solely on its input.

All these things in the other answers about the quality of various algorithms are important and should be considered.
But time and space are two things that vary at some rate compared to the size of the input (n). So what else can vary according to n?
There are several that are related to I/O. For example, the number of writes to a disk is an important one, which may not be directly shown by space and time estimates alone. This becomes particularly important with flash memory, where the number of writes to the same memory location is the significant metric in some algorithms.
Another I/O metric would be "chattiness". A networking protocol might send shorter messages more often adding up to the same space and time as another networking protocol, but some aspect of the system (perhaps billing?) might make minimizing either the size or number of the messages desireable.
And that brings us to Cost, which is a very important algorithmic consideration sometimes. The cost of an algorithm may be affected by both space and time in different amounts (consider the separate costing of server storage space and gigabits of data transfer), but the cost is the thing that you wish to minimize overall, so it may have its own big-O estimations.

Performance Testing for Calculation-Heavy Programs

What are some good tips and/or techniques for optimizing and improving the performance of calculation heavy programs. I'm talking about things like complication graphics calculations or mathematical and simulation types of programming where every second saved is useful, as opposed to IO heavy programs where only a certain amount of speedup is helpful.
While changing the algorithm is frequently mentioned as the most effective method here,I'm trying to find out how effective different algorithms are in the first place, so I want to create as much efficiency with each algorithm as is possible. The "problem" I'm solving isn't something thats well known, so there are few if any algorithms on the web, but I'm looking for any good advice on how to proceed and what to look for.
I am exploring the differences in effectiveness between evolutionary algorithms and more straightforward approaches for a particular group of related problems. I have written three evolutionary algorithms for the problem already and now I have written an brute force technique that I am trying to make as fast as possible.
Edit: To specify a bit more. I am using C# and my algorithms all revolve around calculating and solving constraint type problems for expressions (using expression trees). By expressions I mean things like x^2 + 4 or anything else like that which would be parsed into an expression tree. My algorithms all create and manipulate these trees to try to find better approximations. But I wanted to put the question out there in a general way in case it would help anyone else.
I am trying to find out if it is possible to write a useful evolutionary algorithm for finding expressions that are a good approximation for various properties. Both because I want to know what a good approximation would be and to see how the evolutionary stuff compares to traditional methods.

It's pretty much the same process as any other optimization: profile, experiment, benchmark, repeat.
First you have to figure out what sections of your code are taking up the time. Then try different methods to speed them up (trying methods based on merit would be a better idea than trying things at random). Benchmark to find out if you actually did speed them up. If you did, replace the old method with the new one. Profile again.

I would recommend against a brute force approach if it's at all possible to do it some other way. But, here are some guidelines that should help you speed your code up either way.
There are many, many different optimizations you could apply to your code, but before you do anything, you should profile to figure out where the bottleneck is. Here are some profilers that should give you a good idea about where the hot spots are in your code:
GProf
PerfMon2
OProfile
HPCToolkit
These all use sampling to get their data, so the overhead of running them with your code should be minimal. Only GProf requires that you recompile your code. Also, the last three let you do both time and hardware performance counter profiles, so once you do a time (or CPU cycle) profile, you can zoom in on the hotter regions and find out why they might be running slow (cache misses, FP instruction counts, etc.).
Beyond that, it's a matter of thinking about how best to restructure your code, and this depends on what the problem is. It may be that you've just got a loop that the compiler doesn't optimize well, and you can inline or move things in/out of the loop to help the compiler out. Or, if you're running as fast as you can with basic arithmetic ops, you may want to try to exploit vector instructions (SSE, etc.) If your code is parallel, you might have load balance problems, and you may need to restructure your code so that data is better distributed across cores.
These are just a few examples. Performance optimization is complex, and it might not help you nearly enough if you're doing a brute force approach to begin with.
For more information on ways people have optimized things, there were some pretty good examples in the recent Why do you program in assembly? question.

If your optimization problem is (quasi-)convex or can be transformed into such a form, there are far more efficient algorithms than evolutionary search.
If you have large matrices, pay attention to your linear algebra routines. The right algorithm can make shave an order of magnitude off the computation time, especially if your matrices are sparse.
Think about how data is loaded into memory. Even when you think you're spending most of your time on pure arithmetic, you're actually spending a lot of time moving things between levels of cache etc. Do as much as you can with the data while it's in the fastest memory.
Try to avoid unnecessary memory allocation and de-allocation. Here's where it can make sense to back away from a purely OO approach.

This is more of a tip to find holes in the algorithm itself...
To realize maximum performance, simplify everything inside the most inner loop at the expense of everything else.
One example of keeping things simple is the classic bouncing ball animation. You can implement gravity by looking up the definition in your physics book and plugging in the numbers, or you can do it like this and save precious clock cycles:
initialize:
float y = 0; // y coordinate
float yi = 0; // incremental variable
loop:
y += yi;
yi += 0.001;
if (y > 10)
yi = -yi;
But now let's say you're having to do this with nested loops in an N-body simulation where every particle is attracted to every other particle. This can be an enormously processor intensive task when you're dealing with thousands of particles.
You should of course take the same approach as to simplify everything inside the most inner loop. But more than that, at the very simplest level you should also use data types wisely. For example, math operations are faster when working with integers than floating point variables. Also, addition is faster than multiplication, and multiplication is faster than division.
So with all of that in mind, you should be able to simplify the most inner loop using primarily addition and multiplication of integers. And then any scaling down you might need to do can be done afterwards. To take the y and yi example, if yi is an integer that you modify inside the inner loop then you could scale it down after the loop like this:
y += yi * 0.01;
These are very basic low-level performance tips, but they're all things I try to keep in mind whenever I'm working with processor intensive algorithms. Of course, if you then take these ideas and apply them to parallel processing on a GPU then you can take your algorithm to a whole new level. =)

Well how you do this depends the most on which language
you are using. Still, the key in any language
in the profiler. Profile your code. See which
functions/operations are taking the most time and then determine
if you can make these costly operations more efficient.
Standard bottlenecks in numerical algorithms are memory
usage (do you access matrices in the order which the elements
are stored in memory); communication overhead, etc. They
can be little different than other non-numerical programs.
Moreover, many other factors such as preconditioning, etc.
can lead to drastically difference performance behavior
of the SAME algorithm on the same problem. Make sure
you determine optimal parameters for your implementations.
As for comparing different algorithms, I recommend
reading the paper
"Benchmarking optimization software with performance profiles,"
Jorge Moré and Elizabeth D. Dolan, Mathematical Programming 91 (2002), 201-213.
It provides a nice, uniform way to compare different algorithms being
applied to the same problem set. It really should be better known
outside of the optimization community (in my not so humble opinion
at least).
Good luck!

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio