Evaluating a function at a particular value in parallel - algorithm

The question may seem vague, but let me explain it.
Suppose we have a function f(x,y,z ....) and we need to find its value at the point (x1,y1,z1 .....).
The most trivial approach is to just replace (x,y,z ...) with (x1,y1,z1 .....).
Now suppose that the function is taking a lot of time in evaluation and I want to parallelize the algorithm to evaluate it. Obviously it will depend on the nature of function, too.
So my question is: what are the constraints that I have to look for while "thinking" to parallelize f(x,y,z...)?
If possible, please share links to study.

Asking the question in such a general way does not permit very specific advice to be given.
I'd begin the analysis by looking for ways to evaluate or rewrite the function using groups of variables that interact closely, creating intermediate expressions that can be used to make the final evaluation. You may find a way to do this involving a hierarchy of subexpressions that leads from the variables themselves to the final function.
In general the shorter and wider such an evaluation tree is, the greater the degree of parallelism. There are two cautionary notes to keep in mind that detract from "more parallelism is better."
For one thing a highly parallel approach may actually involve more total computation than your original "serial" approach. In fact some loss of efficiency in this regard is to be expected, since a serial approach can take advantage of all prior subexpression evaluations and maximize their reuse.
For another thing the parallel evaluation will often have worse rounding/accuracy behavior than a serial evaluation chosen to give good or optimal error estimates.
A lot of work has been done on evaluations that involve matrices, where there is usually a lot of symmetry to how the function value depends on its arguments. So it helps to be familiar with numerical linear algebra and parallel algorithms that have been developed there.
Another area where a lot is known is for multivariate polynomial and rational functions.
When the function is transcendental, one might hope for some transformations or refactoring that makes the dependence more tractable (algebraic).
Not directly relevant to your question are algorithms that amortize the cost of computing function values across a number of arguments. For example in computing solutions to ordinary differential equations, there may be "multi-step" methods that share the cost of evaluating derivatives at intermediate points by reusing those values several times.
I'd suggest that your concern to speed up the evaluation of the function suggests that you plan to perform more than one evaluation. So you might think about ways to take advantage of prior evaluations or perform evaluations at related arguments in a way that contributes to your search for parallelism.
Added: Some links and discussion of search strategy
Most authors use the phrase "parallel function evaluation" to
mean evaluating the same function at multiple argument points.
See for example:
[Coarse Grained Parallel Function Evaluation -- Rulon and Youssef]
http://cdsweb.cern.ch/record/401028/files/p837.pdf
A search strategy to find the kind of material Gaurav Kalra asks
about should try to avoid those. For example, we might include
"fine-grained" in our search terms.
It's also effective to focus on specific kinds of functions, e.g.
"polynomial evaluation" rather than "function evaluation".
Here for example we have a treatment of some well-known techniques
for "fast" evaluations applied to design for GPU-based computation:
[How to obtain efficient GPU kernels -- Cruz, Layton, and Barba]
http://arxiv.org/PS_cache/arxiv/pdf/1009/1009.3457v1.pdf
(from their Abstract) "Here, we have tackled fast summation
algorithms (fast multipole method and fast Gauss transform),
and applied algorithmic redesign for attaining performance on
GPUs. The progression of performance improvements attained
illustrates the exercise of formulating algorithms for the
massively parallel architecture of the GPU."
Another search term that might be worth excluding is "pipelined".
This term invariably discusses the sort of parallelism that can
be used when multiple function evaluations are to be done. Early
stages of the computation can be done in parallel with later
stages, but on different inputs.
So that's a search term that one might want to exclude. Or not.
Here's a paper that discusses n-fold speedup for n-variate
polynomial evaluation over finite fields GF(p). This might be
of direct interest for cryptographic applications, but the
approach via modified Horner's method may be interesting for
its potential for generalization:
[Comparison of Bit and Word Level Algorithms for Evaluating
Unstructured Functions over Finite Rings -- Sunar and Cyganski]
http://www.iacr.org/archive/ches2005/018.pdf
"We present a modification to Horner’s algorithm for evaluating
arbitrary n-variate functions defined over finite rings and fields.
... If the domain is a finite field GF(p) the complexity of
multivariate Horner polynomial evaluation is improved from O(p^n)
to O((p^n)/(2n)). We prove the optimality of the presented algorithm."
Multivariate rational functions can be considered simply as the
ratio of two such polynomial functions. For the special case
of univariate rational functions, which can be particularly
effective in approximating elementary transcendental functions
and others, can be evaluated via finite (resp. truncated)
continued fractions, whose convergents (partial numerators
and denominators) can be defined recursively.
The topic of continued fraction evaluations allows us to segue
to a final link that connects that topic with some familiar
parallelism of numerical linear algebra:
[LU Factorization and Parallel Evaluation of Continued Fractions
-- Ömer Egecioglu]
http://www.cs.ucsb.edu/~omer/DOWNLOADABLE/lu-cf98.pdf
"The first n convergents of a general continued fraction
(CF) can be computed optimally in logarithmic parallel
time using O(n/log(n))processors."

You've asked how to speed up the evalution of a single call to a single function. Unless that evaluation time is measured in hours, it isn't clear why it is worth the bother to speed it up. If you insist on speeding up the function execution itself, you'll have to inspect its content to see if some aspects of it are parallelizable. You haven't provided any information on what it computes or how it does so, so it is hard to give any further advice on this aspect. hardmath's answer suggests some ideas you can use, depending on the actual internal structure of your function.
However, usually people asking your question actually call the function many times (say, N times) for different values of x,y,z (eg., x1,y1,... x2,y2,... xN,yN, ... using your vocabulary).
Yes, if you speed up the execution of the function, making the collective set of calls will speed up and that's what people tend to want. If this is the case, it is "technically easy" to speed up overall execution: make N calls to the function in parallel. Then all the pointwise evaluations happen at the same time. To make this work, you pretty much have make vectors out of the values you want to process (so this kind of trick is called "data parallel" programming). So what you really want is something like:
PARALLEL DO I=1,N
RESULT(I)=F(X[J],Y[J], ...)
END PARALLEL DO
How you implement PARALLEL DO depends on the programming language and libraries you have.
This generally only works if N is a fairly big number, but the more expensive f is to execute, the smaller the effective N.
You can also take advantage of the structure of your function to make this even more efficient. If f computes some internal value the same way for commonly used cases, you might be able
to break out the special cases, pre-compute those, and then use those results to compute "the rest of f" for each individual call.
If you are combining ("reducing") the results of all the functions (e..g, summing all the results), you can do that outside the PARALELL DO loop. If you try to combine results inside the loop, you'll have "loop carried dependencies" and you'll either get the wrong answer or it won't go parallel in the way you expect, depending on your compiler or the parallelism libraries. You can combine the answers efficiently if the combination is some associative/commutative operation such as "sum", by building what amounts to a binary tree and running the evaluation of that in parallel. That's a different problem that also occurs frequently in data parallel computation, but we won't go into further here.
Often the overhead of a parallel for loop is pretty high (forking threads is expensive). So usually people divide the overhead across several iterations:
PARALLEL DO I=1,N,M
DO J=I,I+M
RESULT(J)=F(X[J],Y[J], ...)
END DO
END PARALLEL DO
The constant M requires calibration for efficiency; you have to "tune" it. You also have to take care of the fact that N might not be a multiple of M; that requires just an extra clean loop to handle the edge condition:
PARALLEL DO I=1,int(N/M)*M,M
DO J=I,I+M
RESULT(J)=F(X[J],Y[J], ...)
END DO
END PARALLEL DO
DO J=int(N/M)*M,N,1
RESULT(J)=F(X[J],Y[J], ...)
END DO

Related

Intelligent purely functional sets

Set computations composed of unions, intersections and differences can often be expressed in many different ways. Are there any theories or concrete implementations that try to minimize the amount of computation required to reach a given answer?
For example, I first came across a practical application of this when trying to decompose atoms in a simulation of an amorphous material into neighbor shells where the first shell are the immediate neighbors of some given origin atom and the second shell are those atoms that are neighbors of the first shell not in either the first shell or the one before it:
nth 0 = singleton i
nth 1 = neighbors i
nth n = reduce union (map neighbors (nth(n-1))) - nth(n-1) - nth(n-2)
There are many different ways to solve this. You can incrementally test of membership in each set whilst composing the result or you can compute the union of three neighbor shells and use intersection to remove the previous two shells leaving the outermost one. In practice, solutions that require the construction of large intermediate sets are slower.
Presumably an intelligent set implementation could compose the expression that was to be evaluated and then optimize it (e.g. to reduce the size of intermediate sets) before evaluating it in order to improve performance. Do such set implementations exist?
Your question immediately reminded me of Haskell's stream fusion, described in this paper. The general principle can be summarized quite easily: Instead of storing a list, you store a way to build a list. Then the list transformation functions operate directly on the list generator, meaning that all the operations fuse into a single generation of the data without any intermediate structures. Then when you are done composing operations you run the generator and produce the data.
So I think the answer to your question is that if you wanted some similarly intelligent mechanism that fused computations and eliminated intermediate data structures, you'd need to find a way to transform a set into a "co-structure" (that's what the paper calls it) that generates a set and operate directly on that, then actually generate the set when you are done.
I think there's a very deep theory behind this concept that the paper hints at but never spells out, and if somebody else here knows what it is, please let me know, because this is very relevant to something else I am doing, too!

Methods for crossover in genetic algorithms

When reading about the crossover part of genetic algorithms, books and papers usually refer to methods of simply swapping out bits in the data of two selected candidates which are to reproduce.
I have yet to see actual code of an implemented genetic algorithm for actual industry applications, but I find it hard to imagine that it's enough to operate on simple data types.
I always imagined that the various stages of genetic algorithms would be performed on complex objects involving complex mathematical operations, as opposed to just swapping out some bits in single integers.
Even Wikipedia just lists these kinds of operations for crossover.
Am I missing something important or are these kinds of crossover methods really the only thing used?
There are several things used... although the need for parallelity and several generations (and sometimes a big population) leads to using techniques that perform well...
Another point to keep in mind is that "swapping out some bits" when modeled correctly resembles a simple and rather accurate version of what happens naturally (recombination of genes, mutations)...
For a very simple and nicely written walkthrough see http://www.electricmonk.nl/log/2011/09/28/evolutionary-algorithm-evolving-hello-world/
For some more info see
http://www.codeproject.com/KB/recipes/btl_ga.aspx
http://www.codeproject.com/KB/recipes/genetics_dot_net.aspx
http://www.codeproject.com/KB/recipes/GeneticandAntAlgorithms.aspx
http://www.c-sharpcorner.com/UploadFile/mgold/GeneticAlgorithm12032005044205AM/GeneticAlgorithm.aspx
I always imagined that the various stages of genetic algorithms would be performed on complex objects involving complex mathematical operations, as opposed to just swapping out some bits in single integers.
You probably think complex mathematical operations are used because you think the Genetic Algorithm has to modify a complex object. That's usually not how a Genetic Algorithm works.
So what does happen? Well, usually, the programmer (or scientist) will identify various parameters in a configuration, and then map those parameters to integers/floats. This does limit the directions in which the algorithm can explore, but that's the only realistic method of getting any results.
Let's look at evolving an antenna. You could perform complex simulation with an genetic algorithm rearranging copper molecules, but that would be very complex and take forever. Instead, you'd identify antenna "parameters". Most antenna's are built up out of certain lengths of wire, bent at certain places in order to maximize their coverage area. So you could identify a couple of parameters: number of starting wires, section lengths, angle of the bends. All those are easily represented as integer numbers, and are therefor easy for the Genetic Algorithm to manipulate. The resulting manipulations can be fed into an "Antenna simulator", to see how well it receives signals.
In short, when you say:
I find it hard to imagine that it's enough to operate on simple data types.
you must realize that simple data types can be mapped to much more intricate structures. The Genetic Algorithm doesn't have to know anything about these intricate structures. All it needs to know is how it can manipulate the parameters that build up the intricate structures. That is, after all, the way DNA works.
In Genetic algorithms usually bitswapping of some variety is used.
As you have said:
I always imagined that the various stages of genetic algorithms would
be performed on complex objects involving complex mathematical
operations
What i think you are looking for is Genetic Programming, where the chromosome describes a program, in this case you would be able to do more with the operators when applying crossover.
Also make sure you have understood the difference between your fitness function in genetic algorithms, and the operators within a chromosome in genetic programming.
Different applications require different encondigs. The goal certainly is to find the most effective encoding and often enough the simple encodings are the better suited. So for example a Job Shop Scheduling Problem might be represented as list of permutations which represent the execution order of the jobs on the different machines (so called job sequence matrix). It can however also be represented as a list of priority rules that construct the schedule. A traveling salesman problem or quadratic assignment problem is typically represented by a single permutation that denotes a tour in one case or an assignment in another case. Optimizing the parameters of a simulation model or finding the root of a complex mathematical function is typically represented by a vector of real values.
For all those, still simple types crossover and mutation operators exist. For the permutation these are e.g. OX, ERX, CX, PMX, UBX, OBX, and many more. If you can combine a number of simple representations to represent a solution of your complex problem you might reuse these operations and apply them to each component individually.
The important thing about crossover to work effectively is that a few properties should be fulfilled:
The crossover should conserve those parts that are similar in both
For those parts that are not similar, the crossover should not introduce an element that is not already part of one of the parents
The crossover of two solutions should, if possible, produce a feasible solution
You want to avoid so called unwanted mutations in your crossovers. In that light you also want to avoid having to repair a large part of your chromosomes after crossover, since that is also introducing unwanted mutations.
If you want to experiment with different operators and problems, we have a nice GUI-driven software: HeuristicLab.
Simple Bit swapping is usually the way to go. The key thing to note is the encoding that is used in each candidate solution. Solutions should be encoded such that there is minimal or no error introduced into the new offspring. Any error would require that the algorithm to provide a fix which will lead to increased processing time.
As an example I have developed a university timetable generator in C# that uses a integer encoding to represent the timeslots available in each day. This representation allows very efficient single point or multi-point crossover operator which uses the LINQ intersect function to combine parents.
Typical multipoint crossover with hill-climbing
public List<TimeTable> CrossOver(List<TimeTable> parents) // Multipoint crossover
{
var baby1 = new TimeTable {Schedule = new List<string>(), Fitness = 0};
var baby2 = new TimeTable {Schedule = new List<string>(), Fitness = 0};
for (var gen = 0; gen < parents[0].Schedule.Count; gen++)
{
if (rnd.NextDouble() < (double) CrossOverProb)
{
baby2.Schedule.Add(parents[0].Schedule[gen]);
baby1.Schedule.Add(parents[1].Schedule[gen]);
}
else
{
baby1.Schedule.Add(parents[0].Schedule[gen]);
baby2.Schedule.Add(parents[1].Schedule[gen]);
}
}
CalculateFitness(ref baby1);
CalculateFitness(ref baby2);
// allow hill-climbing
parents.Add(baby1);
parents.Add(baby2);
return parents.OrderByDescending(i => i.Fitness).Take(2).ToList();
}

What can be parameters other than time and space while analyzing certain algorithms?

I was interested to know about parameters other than space and time during analysing the effectiveness of an algorithms. For example, we can focus on the effective trap function while developing encryption algorithms. What other things can you think of ?
First and foremost there's correctness. Make sure your algorithm always works, no matter what the input. Even for input that the algorithm is not designed to handle, you should print an error mesage, not crash the entire application. If you use greedy algorithms, make sure they truly work in every case, not just a few cases you tried by hand.
Then there's practical efficiency. An O(N2) algorithm can be a lot faster than an O(N) algorithm in practice. Do actual tests and don't rely on theoretical results too much.
Then there's ease of implementation. You usually don't need the best intro sort implementation to sort an array of 100 integers once, so don't bother.
Look for worst cases in your algorithms and if possible, try to avoid them. If you have a generally fast algorithm but with a very bad worst case, consider detecting that worst case and solving it using another algorithm that is generally slower but better for that single case.
Consider space and time tradeoffs. If you can afford the memory in order to get better speeds, there's probably no reason not to do it, especially if you really need the speed. If you can't afford the memory but can afford to be slower, do that.
If you can, use existing libraries. Don't roll your own multiprecision library if you can use GMP for example. For C++, stuff like boost and even the STL containers and algorithms have been worked on for years by an army of people and are most likely better than you can do alone.
Stability (sorting) - Does the algorithm maintain the relative order of equal elements?
Numeric Stability - Is the algorithm prone to error when very large or small real numbers are used?
Correctness - Does the algorithm always give the correct answer? If not, what is the margin of error?
Generality - Does the algorithm work in many situation (e.g. with many different data types)?
Compactness - Is the program for the algorithm concise?
Parallelizability - How well does performance scale when the number of concurrent threads of execution are increased?
Cache Awareness - Is the algorithm designed to maximize use of the computer's cache?
Cache Obliviousness - Is the algorithm tuned for particulary cache-sizes / cache-line-sizes or does it perform well regardless of the parameters of the cache?
Complexity. 2 algorithms being the same in all other respects, the one that's much simpler is going to be a much better candidate for future customization and use.
Ease of parallelization. Depending on your use case, it might not make any difference or, on the other hand, make the algorithm useless because it can't use 10000 cores.
Stability - some algorithms may "blow up" with certain test conditions, e.g. take an inordinately long time to execute, or use an inordinately large amount of memory, or perhaps not even terminate.
For algorithms that perform floating point operations, the accumulation of round-off error is often a consideration.
Power consumption, for embedded algorithms (think smartcards).
One important parameter that is frequently measure in the analysis of algorithms is that of Cache hits and cache misses. While this is a very implementation and architecture dependent issue, it is possible to generalise somewhat. One particularly interesting property of the algorithm is being Cache-oblivious, which means that the algorithm will use the cache optimally on multiple machines with different cache sizes and structures without modification.
Time and space are the big ones, and they seem so plain and definitive, whereby they should often be qualified (1). The fact that the OP uses the word "parameter" rather than say "criteria" or "properties" is somewhat indicative of this (as if a big O value on time and on space was sufficient to frame the underlying algorithm).
Other criteria include:
domain of applicability
complexity
mathematical tractability
definitiveness of outcome
ease of tuning (may be tied to "complexity" and "tactability" afore mentioned)
ability of running the algorithm in a parallel fashion
(1) "qualified": As hinted in other answers, a -technically- O(n^2) algorithm may be found to be faster than say an O(n) algorithm, in 90% of the cases (which, btw, may turn out to be 100% of the practical cases)
worst case and best case are also interesting, especially when linked to some conditions in the input. if your input data shows some properties, an algorithm, by taking advantage of this property, may perform better that another algorithm which performs the same task but does not use that property.
for example, many sorting algorithm perform very efficiently when input are partially ordered in a specific way which minimizes the number of operations the algorithm has to execute.
(if your input is mostly sorted, an insertion sort will fit nicely, while you would never use that algorithm otherwise)
If we're talking about algorithms in general, then (in the real world) you might have to think about CPU/filesystem(read/write operations)/bandwidth usage.
True they are way down there in the list of things you need worry about these days, but given a massive enough volume of data and cheap enough infrastructure you might have to tweak your code to ease up on one or the other.
What you are interested aren’t parameters, rather they are intrinsic properties of an algorithm.
Anyway, another property you might be interested in, and analyse an algorithm for, concerns heuristics (or rather, approximation algorithms), i.e. algorithms which don’t find an exact solution but rather one that is (hopefully) good enough.
You can analyze how far a solution is from the theoretical optimal solution in the worst case. For example, an existing algorithm (forgot which one) approximates the optimal travelling salesman tour by a factor of two, i.e. in the worst case it’s twice as long as the optimal tour.
Another metric concerns randomized algorithms where randomization is used to prevent unwanted worst-case behaviours. One example is randomized quicksort; quicksort has a worst-case running time of O(n2) which we want to avoid. By shuffling the array beforehand we can avoid the worst-case (i.e. an already sorted array) with a very high probability. Just how high this probability is can be important to know; this is another intrinsic property of the algorithm that can be analyzed using stochastic.
For numeric algorithms, there's also the property of continuity: that is, whether if you change input slightly, output also changes only slightly. See also Continuity analysis of programs on Lambda The Ultimate for a discussion and a link to an academical paper.
For lazy languages, there's also strictness: f is called strict if f _|_ = _|_ (where _|_ denotes the bottom (in the sense of domain theory), a computation that can't produce a result due to non-termination, errors etc.), otherwise it is non-strict. For example, the function \x -> 5 is non-strict, because (\x -> 5) _|_ = 5, whereas \x -> x + 1 is strict.
Another property is determinicity: whether the result of the algorithm (or its other properties, such as running time or space consumption) depends solely on its input.
All these things in the other answers about the quality of various algorithms are important and should be considered.
But time and space are two things that vary at some rate compared to the size of the input (n). So what else can vary according to n?
There are several that are related to I/O. For example, the number of writes to a disk is an important one, which may not be directly shown by space and time estimates alone. This becomes particularly important with flash memory, where the number of writes to the same memory location is the significant metric in some algorithms.
Another I/O metric would be "chattiness". A networking protocol might send shorter messages more often adding up to the same space and time as another networking protocol, but some aspect of the system (perhaps billing?) might make minimizing either the size or number of the messages desireable.
And that brings us to Cost, which is a very important algorithmic consideration sometimes. The cost of an algorithm may be affected by both space and time in different amounts (consider the separate costing of server storage space and gigabits of data transfer), but the cost is the thing that you wish to minimize overall, so it may have its own big-O estimations.

What is an efficient way to go beyond a greedy algorithm

The domain of this question is scheduling operations on constrained hardware. The resolution of the result is the number of clock cycles the schedule fits within. The search space grows very rapidly where early decisions constrain future decisions and the total number of possible schedules grows rapidly and exponentially. A lot of the possible schedules are equivalent because just swapping the order of two instructions usually result in the same timing constraint.
Basically the question is what is a good strategy for exploring the vast search space without spending too much time. I expect to search only a small fraction but would like to explore different parts of the search space while doing so.
The current greedy algorithm tend to make stupid decisions early on sometimes and the attempt at branch and bound was beyond slow.
Edit:
Want to point out that the result is very binary with perhaps the greedy algorithm ending up using 8 cycles while there exists a solution using only 7 cycles using branch and bound.
Second point is that there are significant restrictions in data routing between instructions and dependencies between instructions that limits the amount of commonality between solutions. Look at it as a knapsack problem with a lot of ordering constraints as well as some solutions completely failing because of routing congestion.
Clarification:
In each cycle there is a limit to how many operations of each type and some operations have two possible types. There are a set of routing constraints which can be varied to be either fairly tight or pretty forgiving and the limit depends on routing congestion.
Integer linear optimization for NP-hard problems
Depending on your side constraints, you may be able to use the critical path method or
(as suggested in a previous answer) dynamic programming. But many scheduling problems are NP-hard just like the classical traveling sales man --- a precise solution has a worst case of exponential search time, just as you describe in your problem.
It's important to know that while NP-hard problems still have a very bad worst case solution time there is an approach that very often produces exact answers with very short computations (the average case is acceptable and you often don't see the worst case).
This approach is to convert your problem to a linear optimization problem with integer variables. There are free-software packages (such as lp-solve) that can solve such problems efficiently.
The advantage of this approach is that it may give you exact answers to NP-hard problems in acceptable time. I used this approach in a few projects.
As your problem statement does not include more details about the side constraints, I cannot go into more detail how to apply the method.
Edit/addition: Sample implementation
Here are some details about how to implement this method in your case (of course, I make some assumptions that may not apply to your actual problem --- I only know the details form your question):
Let's assume that you have 50 instructions cmd(i) (i=1..50) to be scheduled in 10 or less cycles cycle(t) (t=1..10). We introduce 500 binary variables v(i,t) (i=1..50; t=1..10) which indicate whether instruction cmd(i) is executed at cycle(t) or not. This basic setup gives the following linear constraints:
v_it integer variables
0<=v_it; v_it<=1; # 1000 constraints: i=1..50; t=1..10
sum(v_it: t=1..10)==1 # 50 constraints: i=1..50
Now, we have to specify your side conditions. Let's assume that operations cmd(1)...cmd(5) are multiplication operations and that you have exactly two multipliers --- in any cycle, you may perform at most two of these operations in parallel:
sum(v_it: i=1..5)<=2 # 10 constraints: t=1..10
For each of your resources, you need to add the corresponding constraints.
Also, let's assume that operation cmd(7) depends on operation cmd(2) and needs to be executed after it. To make the equation a little bit more interesting, lets also require a two cycle gap between them:
sum(t*v(2,t): t=1..10) + 3 <= sum(t*v(7,t): t=1..10) # one constraint
Note: sum(t*v(2,t): t=1..10) is the cycle t where v(2,t) is equal to one.
Finally, we want to minimize the number of cycles. This is somewhat tricky because you get quite big numbers in the way that I propose: We give assign each v(i,t) a price that grows exponentially with time: pushing off operations into the future is much more expensive than performing them early:
sum(6^t * v(i,t): i=1..50; t=1..10) --> minimum. # one target function
I choose 6 to be bigger than 5 to ensure that adding one cycle to the system makes it more expensive than squeezing everything into less cycles. A side-effect is that the program will go out of it's way to schedule operations as early as possible. You may avoid this by performing a two-step optimization: First, use this target function to find the minimal number of necessary cycles. Then, ask the same problem again with a different target function --- limiting the number of available cycles at the outset and imposing a more moderate price penalty for later operations. You have to play with this, I hope you got the idea.
Hopefully, you can express all your requirements as such linear constraints in your binary variables. Of course, there may be many opportunities to exploit your insight into your specific problem to do with less constraints or less variables.
Then, hand your problem off to lp-solve or cplex and let them find the best solution!
At first blush, it sounds like this problem might fit into a dynamic programming solution. Several operations may take the same amount of time so you might end up with overlapping subproblems.
If you can map your problem to the "travelling salesman" (like: Find the optimal sequence to run all operations in minimum time), then you have an NP-complete problem.
A very quick way to solve that is the ant algorithm (or ant colony optimization).
The idea is that you send an ant down every path. The ant spreads a smelly substance on the path which evaporates over time. Short parts mean that the path will stink more when the next ant comes along. Ants prefer smelly over clean paths. Run thousands of ants through the network. The most smelly path is the optimal one (or at least very close).
Try simulated annealing, cfr. http://en.wikipedia.org/wiki/Simulated_annealing .

Performance Testing for Calculation-Heavy Programs

What are some good tips and/or techniques for optimizing and improving the performance of calculation heavy programs. I'm talking about things like complication graphics calculations or mathematical and simulation types of programming where every second saved is useful, as opposed to IO heavy programs where only a certain amount of speedup is helpful.
While changing the algorithm is frequently mentioned as the most effective method here,I'm trying to find out how effective different algorithms are in the first place, so I want to create as much efficiency with each algorithm as is possible. The "problem" I'm solving isn't something thats well known, so there are few if any algorithms on the web, but I'm looking for any good advice on how to proceed and what to look for.
I am exploring the differences in effectiveness between evolutionary algorithms and more straightforward approaches for a particular group of related problems. I have written three evolutionary algorithms for the problem already and now I have written an brute force technique that I am trying to make as fast as possible.
Edit: To specify a bit more. I am using C# and my algorithms all revolve around calculating and solving constraint type problems for expressions (using expression trees). By expressions I mean things like x^2 + 4 or anything else like that which would be parsed into an expression tree. My algorithms all create and manipulate these trees to try to find better approximations. But I wanted to put the question out there in a general way in case it would help anyone else.
I am trying to find out if it is possible to write a useful evolutionary algorithm for finding expressions that are a good approximation for various properties. Both because I want to know what a good approximation would be and to see how the evolutionary stuff compares to traditional methods.
It's pretty much the same process as any other optimization: profile, experiment, benchmark, repeat.
First you have to figure out what sections of your code are taking up the time. Then try different methods to speed them up (trying methods based on merit would be a better idea than trying things at random). Benchmark to find out if you actually did speed them up. If you did, replace the old method with the new one. Profile again.
I would recommend against a brute force approach if it's at all possible to do it some other way. But, here are some guidelines that should help you speed your code up either way.
There are many, many different optimizations you could apply to your code, but before you do anything, you should profile to figure out where the bottleneck is. Here are some profilers that should give you a good idea about where the hot spots are in your code:
GProf
PerfMon2
OProfile
HPCToolkit
These all use sampling to get their data, so the overhead of running them with your code should be minimal. Only GProf requires that you recompile your code. Also, the last three let you do both time and hardware performance counter profiles, so once you do a time (or CPU cycle) profile, you can zoom in on the hotter regions and find out why they might be running slow (cache misses, FP instruction counts, etc.).
Beyond that, it's a matter of thinking about how best to restructure your code, and this depends on what the problem is. It may be that you've just got a loop that the compiler doesn't optimize well, and you can inline or move things in/out of the loop to help the compiler out. Or, if you're running as fast as you can with basic arithmetic ops, you may want to try to exploit vector instructions (SSE, etc.) If your code is parallel, you might have load balance problems, and you may need to restructure your code so that data is better distributed across cores.
These are just a few examples. Performance optimization is complex, and it might not help you nearly enough if you're doing a brute force approach to begin with.
For more information on ways people have optimized things, there were some pretty good examples in the recent Why do you program in assembly? question.
If your optimization problem is (quasi-)convex or can be transformed into such a form, there are far more efficient algorithms than evolutionary search.
If you have large matrices, pay attention to your linear algebra routines. The right algorithm can make shave an order of magnitude off the computation time, especially if your matrices are sparse.
Think about how data is loaded into memory. Even when you think you're spending most of your time on pure arithmetic, you're actually spending a lot of time moving things between levels of cache etc. Do as much as you can with the data while it's in the fastest memory.
Try to avoid unnecessary memory allocation and de-allocation. Here's where it can make sense to back away from a purely OO approach.
This is more of a tip to find holes in the algorithm itself...
To realize maximum performance, simplify everything inside the most inner loop at the expense of everything else.
One example of keeping things simple is the classic bouncing ball animation. You can implement gravity by looking up the definition in your physics book and plugging in the numbers, or you can do it like this and save precious clock cycles:
initialize:
float y = 0; // y coordinate
float yi = 0; // incremental variable
loop:
y += yi;
yi += 0.001;
if (y > 10)
yi = -yi;
But now let's say you're having to do this with nested loops in an N-body simulation where every particle is attracted to every other particle. This can be an enormously processor intensive task when you're dealing with thousands of particles.
You should of course take the same approach as to simplify everything inside the most inner loop. But more than that, at the very simplest level you should also use data types wisely. For example, math operations are faster when working with integers than floating point variables. Also, addition is faster than multiplication, and multiplication is faster than division.
So with all of that in mind, you should be able to simplify the most inner loop using primarily addition and multiplication of integers. And then any scaling down you might need to do can be done afterwards. To take the y and yi example, if yi is an integer that you modify inside the inner loop then you could scale it down after the loop like this:
y += yi * 0.01;
These are very basic low-level performance tips, but they're all things I try to keep in mind whenever I'm working with processor intensive algorithms. Of course, if you then take these ideas and apply them to parallel processing on a GPU then you can take your algorithm to a whole new level. =)
Well how you do this depends the most on which language
you are using. Still, the key in any language
in the profiler. Profile your code. See which
functions/operations are taking the most time and then determine
if you can make these costly operations more efficient.
Standard bottlenecks in numerical algorithms are memory
usage (do you access matrices in the order which the elements
are stored in memory); communication overhead, etc. They
can be little different than other non-numerical programs.
Moreover, many other factors such as preconditioning, etc.
can lead to drastically difference performance behavior
of the SAME algorithm on the same problem. Make sure
you determine optimal parameters for your implementations.
As for comparing different algorithms, I recommend
reading the paper
"Benchmarking optimization software with performance profiles,"
Jorge Moré and Elizabeth D. Dolan, Mathematical Programming 91 (2002), 201-213.
It provides a nice, uniform way to compare different algorithms being
applied to the same problem set. It really should be better known
outside of the optimization community (in my not so humble opinion
at least).
Good luck!

Resources