Parallelization of a serial reproduction genetic algorithm - parallel-processing

I've implemented a genetic algorithm which uses a reproduction method based on the one described in Regularized Evolution for Image Classifier Architecture Search.
Minimal pseudo-code describing the reproduction method:
num_steps = 1e1000
step = 0
1. while step < total_steps:
1a. winner, loser = tournament_selection(population)
1b. offspring = mutate(winner)
1c. replace loser with offspring
1d. step++
The authors in 1 mention that the loop above is parallelized by distributing the loop above over multiple workers. They also mention that the full implementation is given in the released source, but the linked code does not contain the relevant parallelization part.
I fail to understand how this specific reproduction method can be parallelized: the parent selection step (1a) is dependent on the current state of the population.
Some notes:
This method works better than the vanilla reproduction method which applies selection+mutation to the entire population in its current state, and is easily parallelizable.
I have written to the main author regarding this point but did not get a response.

You can imagine a parallelization scheme as follows (with n workers):
num_steps = 1000
num_cycles = num_steps / n
cycle = 0
while cycle < num_cycles:
children = []
for i in range(n):
children.append(start_worker(current_pop))
for i in range(n):
current_population.remove(oldest)
current_population.append(children)
cycle += 1
start_worker(pop):
winner, loser = tournament_selection(population)
offspring = mutate(winner)
offspring.fitness = compute_fitness(offspring)
return offspring
This way, at each step you create n tournaments and generate n children. I think the parallelization would be efficient since the computational cost often resides in the compute_fitness() method.
Using this method, you would reduce the selection pressure since more children are generated from the current population. This effect could be balanced by increasing the size of the tournament.
It could be interesting to compare the optimization performance on a toy problem with and without this parallelization scheme.

Related

How many subproblems are there in this Activity Selection recursive breakdown?

Activity Selection: Given a set of activities A with start and end times, find a maximum subset of mutually compatible activities.
My problem
The two approaches seem to be the same, but the numSubproblems in firstApproach is exponential, while in secondApproach is O(n^2). If I were to memoize the result, then how can I memoize firstApproach?
The naive firstApproach
let max = 0
for (a: Activities):
let B = {Activities - allIncompatbleWith(a)}
let maxOfSubproblem = ActivitySelection(B)
max = max (max, maxOfSubproblem+1)
return max
1. Assume a particular activity `a` is part of the optimal solution
2. Find the set of activities incompatible with `a: allIncompatibleWith(a)`.
2. Solve Activity for the set of activities: ` {Activities - allImcompatibleWith(a)}`
3. Loop over all activities `a in Activities` and choose maximum.
The CLRS Section 16.1 based secondApproach
Solve for S(0, n+1)
let S(i,j) = 0
for (k: 0 to n):
let a = Activities(k)
let S(i,k) = solution for the set of activities that start after activity-i finishes and end before activity-k starts
let S(k,j) = solution for the set of activities that start after activity-k finishes and end before activyty-j starts.
S(i,j) = max (S(i,k) + S(k,j) + 1)
return S(i,j)
1. Assume a particular activity `a` is part of optimal solution
2. Solve the subproblems for:
(1) activities that finish before `a` starts
(2) activities that start after `a` finishes.
Let S(i, j) refer to the activities that lie between activities i and j (start after i and end before j).
Then S(i,j) characterises the subproblems needed to be solved above. ),
S(i,j) = max S(i,k) + S(k,j) + 1, with the variable k looped over j-i indices.
My analysis
firstApproach:
#numSubproblems = #numSubset of the set of all activities = 2^n.
secondApproach:
#numSubproblems = #number of ways to chooose two indicises from n indices, with repetition. = n*n = O(n^2)
The two approaches seem to be the same, but the numSubproblems in firstApproach is exponential, while in secondApproach is O(n^2). What's the catch? Why are they different, even thought the two approaches seem to be the same?
The two approaches seem to be the same
The two solutions are not the same. The difference is in the number of states possible in the search space. Both solutions exhibit overlapping sub-problems and optimal substructure. Without memoization, both solutions browse through the entire search space.
Solution 1
This a backtracking solution where all subsets that are compatible with an activity are tried and each time an activity is selected, your candidate solution is incremented by 1 and compared with the currently stored maximum. It utilizes no insight of the start times and end times of the activities. The major difference is that the state of your recurrence is the entire subset of activities (compatible activities) for which the solution needs to be determined (regardless of their start and finish times). If you were to memoize the solution, you would have to use a bitmasks (or (std::bitset in C++) to store the solution for a subset of activities. You could also use std::set or other Set data structures.
Solution 2
The number of states for the sub-problems in the second solution are greatly reduced because the recurrence relation solves for only those activities which finish before the start of the current activity and those activities which start after the current activity finishes. Notice that the number of states in such a solution is determined by the number of possible values of the tuple (start time, end time). Since, there are n activities, the number of states are atmost n2. If we memoize this solution, we simply need to store the solution for a given start time and end time, which automatically gives a solution for the subset of activities that fall in this range, regardless of whether they are compatible among themselves.
Memoization always don't lead to polynomial time asymptotic time complexity. In the first approach, you can apply memoization, but that'll not reduce the time complexity to polynomial time.
What is memoization?
In simple words, memoization is nothing but a recursive solution (top-down) that stores the result of computed solution to sub-problem. And if the same sub-problem is to be calculated again, you return the originally stored solution instead of recomputing it.
Memoization in your first recursive solution
In your case each sub-problem is finding optimal selection of activities for a subset. So the memoization (in your case) will result in storing the optimal solution for all the subsets.
No doubt memoization will give you performance enhancements by avoiding recomputation of solution on a subset of activities that has been "seen" before, but it can't (in this case) reduce the time complexity to polynomial time because you end up storing the sub-solutions for every subset (in worst case).
Where memoization gives us real benefit?
On the other hand, if you see this, where memoization is applied for fibonacci series, the total number of sub-solutions that you have to store is linear with the size of the input. And thus it drops the exponential complexity to linear.
How can you memoize the first solution
For applying memoization in the first approach, you need to maintain the sub-solutions. The data-structure that you can use is Map<Set<Activity>, Integer> which will store the maximum number of compatible activities for the given Set<Activity>. In java equals() on a java.util.Set works properly across all the implementations, so you can use it.
Your first approach will be modified like this:
// this structure memoizes the sub-solutions
Map<Set<Activity>, Integer> map;
ActivitySelection(Set<Activity> activities) {
if(map contains activities)
return map.getValueFor(activities);
let max = 0
for (a: activities):
let B = {Activities - allIncompatbleWith(a)}
let maxOfSubproblem = ActivitySelection(B)
max = max (max, maxOfSubproblem+1)
map.put(activities, max)
return max
}
On a lighter note:
The time complexity of the second solution (CLRS 16.1) will be O(n^3) instead of O(n^2). You'll have to have 3 loops for i, j and k. The space complexity for this solution is O(n^2).

Expectation Maximization coin toss examples

I've been self-studying the Expectation Maximization lately, and grabbed myself some simple examples in the process:
http://cs.dartmouth.edu/~cs104/CS104_11.04.22.pdf
There are 3 coins 0, 1 and 2 with P0, P1 and P2 probability landing on Head when tossed. Toss coin 0, if the result is Head, toss coin 1 three times else toss coin 2 three times. The observed data produced by coin 1 and 2 is like this: HHH, TTT, HHH, TTT, HHH. The hidden data is coin 0's result. Estimate P0, P1 and P2.
http://ai.stanford.edu/~chuongdo/papers/em_tutorial.pdf
There are two coins A and B with PA and PB being the probability landing on Head when tossed. Each round, select one coin at random and toss it 10 times then record the results. The observed data is the toss results provided by these two coins. However, we don't know which coin was selected for a particular round. Estimate PA and PB.
While I can get the calculations, I can't relate the ways they are solved to the original EM theory. Specifically, during the M-Step of both examples, I don't see how they're maximizing anything. It just seems they are recalculating the parameters and somehow, the new parameters are better than the old ones. Moreover, the two E-Steps don't even look similar to each other, not to mention the original theory's E-Step.
So how exactly do these example work?
The second PDF won't download for me, but I also visited the wikipedia page http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm which has more information. http://melodi.ee.washington.edu/people/bilmes/mypapers/em.pdf (which claims to be a gentle introduction) might be worth a look too.
The whole point of the EM algorithm is to find parameters which maximize the likelihood of the observed data. This is the only bullet point on page 8 of the first PDF, the equation for capital Theta subscript ML.
The EM algorithm comes in handy where there is hidden data which would make the problem easy if you knew it. In the three coins example this is the result of tossing coin 0. If you knew the outcome of that you could (of course) produce an estimate for the probability of coin 0 turning up heads. You would also know whether coin 1 or coin 2 was tossed three times in the next stage, which would allow you to make estimates for the probabilities of coin 1 and coin 2 turning up heads. These estimates would be justified by saying that they maximized the likelihood of the observed data, which would include not only the results that you are given, but also the hidden data that you are not - the results from coin 0. For a coin that gets A heads and B tails you find that the maximum likelihood for the probability of A heads is A/(A+B) - it might be worth you working this out in detail, because it is the building block for the M step.
In the EM algorithm you say that although you don't know the hidden data, you come in with probability estimates which allow you to write down a probability distribution for it. For each possible value of the hidden data you could find the parameter values which would optimize the log likelihood of the data including the hidden data, and this almost always turns out to mean calculating some sort of weighted average (if it doesn't the EM step may be too difficult to be practical).
What the EM algorithm asks you to do is to find the parameters maximizing the weighted sum of log likelihoods given by all the possible hidden data values, where the weights are given by the probability of the associated hidden data given the observations using the parameters at the start of the EM step. This is what almost everybody, including the Wikipedia algorithm, calls the Q-function. The proof behind the EM algorithm, given in the Wikipedia article, says that if you change the parameters so as to increase the Q-function (which is only a means to an end), you will also have changed them so as to increase the likelihood of the observed data (which you do care about). What you tend to find in practice is that you can maximize the Q-function using a variation of what you would do if you know the hidden data, but using the probabilities of the hidden data, given the estimates at the start of the EM-step, to weight the observations in some way.
In your example it means totting up the number of heads and tails produced by each coin. In the PDF they work out P(Y=H|X=) = 0.6967. This means that you use weight 0.6967 for the case Y=H, which means that you increment the counts for Y=H by 0.6967 and increment the counts for X=H in coin 1 by 3*0.6967, and you increment the counts for Y=T by 0.3033 and increment the counts for X=H in coin 2 by 3*0.3033. If you have a detailed justification for why A/(A+B) is a maximum likelihood of coin probabilities in the standard case, you should be ready to turn it into a justification for why this weighted updating scheme maximizes the Q-function.
Finally, the log likelihood of the observed data (the thing you are maximizing) gives you a very useful check. It should increase with every EM step, at least until you get so close to convergence that rounding error comes in, in which case you may have a very small decrease, signalling convergence. If it decreases dramatically, you have a bug in your program or your maths.
As luck would have it, I have been struggling with this material recently as well. Here is how I have come to think of it:
Consider a related, but distinct algorithm called the classify-maximize algorithm, which we might use as a solution technique for a mixture model problem. A mixture model problem is one where we have a sequence of data that may be produced by any of N different processes, of which we know the general form (e.g., Gaussian) but we do not know the parameters of the processes (e.g., the means and/or variances) and may not even know the relative likelihood of the processes. (Typically we do at least know the number of the processes. Without that, we are into so-called "non-parametric" territory.) In a sense, the process which generates each data is the "missing" or "hidden" data of the problem.
Now, what this related classify-maximize algorithm does is start with some arbitrary guesses at the process parameters. Each data point is evaluated according to each one of those parameter processes, and a set of probabilities is generated-- the probability that the data point was generated by the first process, the second process, etc, up to the final Nth process. Then each data point is classified according to the most likely process.
At this point, we have our data separated into N different classes. So, for each class of data, we can, with some relatively simple calculus, optimize the parameters of that cluster with a maximum likelihood technique. (If we tried to do this on the whole data set prior to classifying, it is usually analytically intractable.)
Then we update our parameter guesses, re-classify, update our parameters, re-classify, etc, until convergence.
What the expectation-maximization algorithm does is similar, but more general: Instead of a hard classification of data points into class 1, class 2, ... through class N, we are now using a soft classification, where each data point belongs to each process with some probability. (Obviously, the probabilities for each point need to sum to one, so there is some normalization going on.) I think we might also think of this as each process/guess having a certain amount of "explanatory power" for each of the data points.
So now, instead of optimizing the guesses with respect to points that absolutely belong to each class (ignoring the points that absolutely do not), we re-optimize the guesses in the context of those soft classifications, or those explanatory powers. And it so happens that, if you write the expressions in the correct way, what you're maximizing is a function that is an expectation in its form.
With that said, there are some caveats:
1) This sounds easy. It is not, at least to me. The literature is littered with a hodge-podge of special tricks and techniques-- using likelihood expressions instead of probability expressions, transforming to log-likelihoods, using indicator variables, putting them in basis vector form and putting them in the exponents, etc.
These are probably more helpful once you have the general idea, but they can also obfuscate the core ideas.
2) Whatever constraints you have on the problem can be tricky to incorporate into the framework. In particular, if you know the probabilities of each of the processes, you're probably in good shape. If not, you're also estimating those, and the sum of the probabilities of the processes must be one; they must live on a probability simplex. It is not always obvious how to keep those constraints intact.
3) This is a sufficiently general technique that I don't know how I would go about writing code that is general. The applications go far beyond simple clustering and extend to many situations where you are actually missing data, or where the assumption of missing data may help you. There is a fiendish ingenuity at work here, for many applications.
4) This technique is proven to converge, but the convergence is not necessarily to the global maximum; be wary.
I found the following link helpful in coming up with the interpretation above: Statistical learning slides
And the following write-up goes into great detail of some painful mathematical details: Michael Collins' write-up
I wrote the below code in Python which explains the example given in your second example paper by Do and Batzoglou.
I recommend that you read this link first for a clear explanation of how and why the 'weightA' and 'weightB' in the code below are obtained.
Disclaimer : The code does work but I am certain that it is not coded optimally. I am not a Python coder normally and have started using it two weeks ago.
import numpy as np
import math
#### E-M Coin Toss Example as given in the EM tutorial paper by Do and Batzoglou* ####
def get_mn_log_likelihood(obs,probs):
""" Return the (log)likelihood of obs, given the probs"""
# Multinomial Distribution Log PMF
# ln (pdf) = multinomial coeff * product of probabilities
# ln[f(x|n, p)] = [ln(n!) - (ln(x1!)+ln(x2!)+...+ln(xk!))] + [x1*ln(p1)+x2*ln(p2)+...+xk*ln(pk)]
multinomial_coeff_denom= 0
prod_probs = 0
for x in range(0,len(obs)): # loop through state counts in each observation
multinomial_coeff_denom = multinomial_coeff_denom + math.log(math.factorial(obs[x]))
prod_probs = prod_probs + obs[x]*math.log(probs[x])
multinomial_coeff = math.log(math.factorial(sum(obs))) - multinomial_coeff_denom
likelihood = multinomial_coeff + prod_probs
return likelihood
# 1st: Coin B, {HTTTHHTHTH}, 5H,5T
# 2nd: Coin A, {HHHHTHHHHH}, 9H,1T
# 3rd: Coin A, {HTHHHHHTHH}, 8H,2T
# 4th: Coin B, {HTHTTTHHTT}, 4H,6T
# 5th: Coin A, {THHHTHHHTH}, 7H,3T
# so, from MLE: pA(heads) = 0.80 and pB(heads)=0.45
# represent the experiments
head_counts = np.array([5,9,8,4,7])
tail_counts = 10-head_counts
experiments = zip(head_counts,tail_counts)
# initialise the pA(heads) and pB(heads)
pA_heads = np.zeros(100); pA_heads[0] = 0.60
pB_heads = np.zeros(100); pB_heads[0] = 0.50
# E-M begins!
delta = 0.001
j = 0 # iteration counter
improvement = float('inf')
while (improvement>delta):
expectation_A = np.zeros((5,2), dtype=float)
expectation_B = np.zeros((5,2), dtype=float)
for i in range(0,len(experiments)):
e = experiments[i] # i'th experiment
ll_A = get_mn_log_likelihood(e,np.array([pA_heads[j],1-pA_heads[j]])) # loglikelihood of e given coin A
ll_B = get_mn_log_likelihood(e,np.array([pB_heads[j],1-pB_heads[j]])) # loglikelihood of e given coin B
weightA = math.exp(ll_A) / ( math.exp(ll_A) + math.exp(ll_B) ) # corresponding weight of A proportional to likelihood of A
weightB = math.exp(ll_B) / ( math.exp(ll_A) + math.exp(ll_B) ) # corresponding weight of B proportional to likelihood of B
expectation_A[i] = np.dot(weightA, e)
expectation_B[i] = np.dot(weightB, e)
pA_heads[j+1] = sum(expectation_A)[0] / sum(sum(expectation_A));
pB_heads[j+1] = sum(expectation_B)[0] / sum(sum(expectation_B));
improvement = max( abs(np.array([pA_heads[j+1],pB_heads[j+1]]) - np.array([pA_heads[j],pB_heads[j]]) ))
j = j+1
The key to understanding this is knowing what the auxiliary variables are that make estimation trivial. I will explain the first example quickly, the second follows a similar pattern.
Augment each sequence of heads/tails with two binary variables, which indicate whether coin 1 was used or coin 2. Now our data looks like the following:
c_11 c_12
c_21 c_22
c_31 c_32
...
For each i, either c_i1=1 or c_i2=1, with the other being 0. If we knew the values these variables took in our sample, estimation of parameters would be trivial: p1 would be the proportion of heads in samples where c_i1=1, likewise for c_i2, and \lambda would be the mean of the c_i1s.
However, we don't know the values of these binary variables. So, what we basically do is guess them (in reality, take their expectation), and then update the parameters in our model assuming our guesses were correct. So the E step is to take the expectation of the c_i1s and c_i2s. The M step is to take maximum likelihood estimates of p_1, p_2 and \lambda given these cs.
Does that make a bit more sense? I can write out the updates for the E and M step if you prefer. EM then just guarantees that by following this procedure, likelihood will never decrease as iterations increase.

Improved Iterative Scaling Method when # feature is large

In log-linear model, we can find maximum entropy solution using IIS.
We update the parameters by finding a paramters which makes model expectation over a feature and empirical expectation matches.
However, there is a exp( sum of all features) in the equation.
My question is that when the number of feature is large (say 10000) then summation over all the features will blow easily. How can we solve this problem by numerical method ? To me it seems impossible since even compute exp(50) will blow.
Do computations in log-space, and use a logsumexp operation (borrowed from scikit-learn):
// Pseudocode for 1-d version of logsumexp:
// computes log(sum(exp(x) for x in a)) in a numerically stable way.
def logsumexp(a : array of float):
amax = maximum(a)
sum = 0.
for x in a:
sum += exp(x - amax)
return log(sum) + amax
This summation can be done once, before the start of the main loop, because the feature values don't change while you're optimizing.
Side remark: IIS is quite old-fashioned. Since some 10 years, almost everyone's been using L-BFGS-B, OWL-QN or (A)SGD to fit log-linear models.

Is there a Sorting Algorithm that sorts in O(∞) permutations?

After reading this question and through the various Phone Book sorting scenarios put forth in the answer, I found the concept of the BOGO sort to be quite interesting. Certainly there is no use for this type of sorting algorithm but it did raise an interesting question in my mind-- could their be a sorting algorithm that is infinitely impossible to complete?
In other words, is there a process where one could attempt to compare and re-order a fixed set of data and can yet never achieve an actual sorted list?
This is much more of a theoretical/philosophical question than a practical one and if I was more of a mathematician I'd probably be able to prove/disprove such a possibility. Has anyone asked this question before and if so, what can be said about it?
[edit:] no deterministic process with a finite amount of state takes "O(infinity)" since the slowest it can be is to progress through all possible states. this includes sorting.
[earlier, more specific answer:]
no. for a list of size n you only have state space of size n! in which to store progress (assuming that the entire state of the sort is stored in the ordering of the elements and it really is "doing something," deterministically).
so the worst possible behaviour would cycle through all available states before terminating and take time proportional to n! (at the risk of confusing matters, there must be a single path through the state - since that is "all the state" you cannot have a process move from state X to Y, and then later from state X to Z, since that requires additional state, or is non-deterministic)
Idea 1:
function sort( int[] arr ) {
int[] sorted = quicksort( arr ); // compare and reorder data
while(true); // where'd this come from???
return sorted; // return answer
}
Idea 2
How do you define O(infinity)? The formal definition of Big-O merely states that f(x)=O(g(x)) implies that M*g(x) is an upper bound of f(x) given sufficiently large x and some constant M.
Typically when you talking about "infinity", you are talking about some sort of unbounded limit. So in this case, the only reasonable definition is saying that O(infinity) is O(function that's larger than every function). Obviously a function that's larger than every function is an upper bound. Thus technically everything is "O(infinity)"
Idea 3
Assuming you mean theta notation (tight bound)...
If you impose the additional restriction that the algorithm is smart (returns when it finds a sorted permutation) and every permutation of the list must be visited in a finite amount of time, then the answer no. There are only N! permutations of a list. The upper bound for such a sorting algorithm is then a finite over finite numbers, which is finite.
Your question doesn't really have much to do with sorting. An algorithm which is guaranteed never to complete would be pretty dull. Indeed, even an algorithm which would might or might not ever complete would be pretty dull. Much more interesting would be an algorithm which would be guaranteed to complete, eventually, but whose worst-case computation time with respect to the size of the input would not be expressible as O(F(N)) for any function F that could itself be computed in bounded time. My hunch would be that such an algorithm could be devised, but I'm not sure how.
How about this one:
Start at the first item.
Flip a coin.
If it's heads, switch it with the next item.
If it's tails, don't switch them.
If list is sorted, stop.
If not, move onto the next pair ...
It's a sorting algorithm -- the kind a monkey might do. Is there any guarantee that you'll arrive at a sorted list? I don't think so!
Yes -
SortNumbers(collectionOfNumbers)
{
If IsSorted(collectionOfNumbers){
reverse(collectionOfNumbers(1:end/2))
}
return SortNumbers(collectionOfNumbers)
}
Input: A[1..n] : n unique integers in arbitrary order
Output: A'[1..n] : reordering of the elements of A
such that A'[i] R(A') A'[j] if i < j.
Comparator: a R(A') b iff A'[i] = a, A'[j] = b and i > j
More generally, make the comparator something that's either (a) impossible to reconcile with the output specification, so that no solution can exist, or (b) uncomputable (e.g., sort these (input, turing machine) pairs in order of the number of steps needed for the machine to halt on the input).
Even more generally, if you have a procedure that fails to halt on a valid input, the procedure is not an algorithm which solves the problem on that input/output domain... which means you don't have an algorithm at all, or that what you have is only an algorithm if you appropriately restrict the domain.
Let's suppose that you have a random coin flipper, infinite arithmetic, and infinite rationals. Then the answer is yes. You can write a sorting algorithm which has 100% chance of successfully sorting your data (so it really is a sorting function), but which on average will take infinite time to do so.
Here is an emulation of this in Python.
# We'll pretend that these are true random numbers.
import random
import fractions
def flip ():
return 0.5 < random.random()
# This tests whether a number is less than an infinite precision number in the range
# [0, 1]. It has a 100% probability of returning an answer.
def number_less_than_rand (x):
high = fractions.Fraction(1, 1)
low = fractions.Fraction(0, 1)
while low < x and x < high:
if flip():
low = (low + high) / 2
else:
high = (low + high) / 2
return high < x
def slow_sort (some_array):
n = fractions.Fraction(100, 1)
# This loop has a 100% chance of finishing, but its average time to complete
# is also infinite. If you haven't studied infinite series and products, you'll
# just have to take this on faith. Otherwise proving that is a fun exercise.
while not number_less_than_rand(1/n):
n += 1
print n
some_array.sort()

What is dynamic programming? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
What is dynamic programming?
How is it different from recursion, memoization, etc?
I have read the wikipedia article on it, but I still don't really understand it.
Dynamic programming is when you use past knowledge to make solving a future problem easier.
A good example is solving the Fibonacci sequence for n=1,000,002.
This will be a very long process, but what if I give you the results for n=1,000,000 and n=1,000,001? Suddenly the problem just became more manageable.
Dynamic programming is used a lot in string problems, such as the string edit problem. You solve a subset(s) of the problem and then use that information to solve the more difficult original problem.
With dynamic programming, you store your results in some sort of table generally. When you need the answer to a problem, you reference the table and see if you already know what it is. If not, you use the data in your table to give yourself a stepping stone towards the answer.
The Cormen Algorithms book has a great chapter about dynamic programming. AND it's free on Google Books! Check it out here.
Dynamic programming is a technique used to avoid computing multiple times the same subproblem in a recursive algorithm.
Let's take the simple example of the Fibonacci numbers: finding the n th Fibonacci number defined by
Fn = Fn-1 + Fn-2 and F0 = 0, F1 = 1
Recursion
The obvious way to do this is recursive:
def fibonacci(n):
if n == 0:
return 0
if n == 1:
return 1
return fibonacci(n - 1) + fibonacci(n - 2)
Dynamic Programming
Top Down - Memoization
The recursion does a lot of unnecessary calculations because a given Fibonacci number will be calculated multiple times. An easy way to improve this is to cache the results:
cache = {}
def fibonacci(n):
if n == 0:
return 0
if n == 1:
return 1
if n in cache:
return cache[n]
cache[n] = fibonacci(n - 1) + fibonacci(n - 2)
return cache[n]
Bottom-Up
A better way to do this is to get rid of the recursion all-together by evaluating the results in the right order:
cache = {}
def fibonacci(n):
cache[0] = 0
cache[1] = 1
for i in range(2, n + 1):
cache[i] = cache[i - 1] + cache[i - 2]
return cache[n]
We can even use constant space and store only the necessary partial results along the way:
def fibonacci(n):
fi_minus_2 = 0
fi_minus_1 = 1
for i in range(2, n + 1):
fi = fi_minus_1 + fi_minus_2
fi_minus_1, fi_minus_2 = fi, fi_minus_1
return fi
How apply dynamic programming?
Find the recursion in the problem.
Top-down: store the answer for each subproblem in a table to avoid having to recompute them.
Bottom-up: Find the right order to evaluate the results so that partial results are available when needed.
Dynamic programming generally works for problems that have an inherent left to right order such as strings, trees or integer sequences. If the naive recursive algorithm does not compute the same subproblem multiple times, dynamic programming won't help.
I made a collection of problems to help understand the logic: https://github.com/tristanguigue/dynamic-programing
Memoization is the when you store previous results of a function call (a real function always returns the same thing, given the same inputs). It doesn't make a difference for algorithmic complexity before the results are stored.
Recursion is the method of a function calling itself, usually with a smaller dataset. Since most recursive functions can be converted to similar iterative functions, this doesn't make a difference for algorithmic complexity either.
Dynamic programming is the process of solving easier-to-solve sub-problems and building up the answer from that. Most DP algorithms will be in the running times between a Greedy algorithm (if one exists) and an exponential (enumerate all possibilities and find the best one) algorithm.
DP algorithms could be implemented with recursion, but they don't have to be.
DP algorithms can't be sped up by memoization, since each sub-problem is only ever solved (or the "solve" function called) once.
It's an optimization of your algorithm that cuts running time.
While a Greedy Algorithm is usually called naive, because it may run multiple times over the same set of data, Dynamic Programming avoids this pitfall through a deeper understanding of the partial results that must be stored to help build the final solution.
A simple example is traversing a tree or a graph only through the nodes that would contribute with the solution, or putting into a table the solutions that you've found so far so you can avoid traversing the same nodes over and over.
Here's an example of a problem that's suited for dynamic programming, from UVA's online judge: Edit Steps Ladder.
I'm going to make quick briefing of the important part of this problem's analysis, taken from the book Programming Challenges, I suggest you check it out.
Take a good look at that problem, if we define a cost function telling us how far appart two strings are, we have two consider the three natural types of changes:
Substitution - change a single character from pattern "s" to a different character in text "t", such as changing "shot" to "spot".
Insertion - insert a single character into pattern "s" to help it match text "t", such as changing "ago" to "agog".
Deletion - delete a single character from pattern "s" to help it match text "t", such as changing "hour" to "our".
When we set each of this operations to cost one step we define the edit distance between two strings. So how do we compute it?
We can define a recursive algorithm using the observation that the last character in the string must be either matched, substituted, inserted or deleted. Chopping off the characters in the last edit operation leaves a pair operation leaves a pair of smaller strings. Let i and j be the last character of the relevant prefix of and t, respectively. there are three pairs of shorter strings after the last operation, corresponding to the string after a match/substitution, insertion or deletion. If we knew the cost of editing the three pairs of smaller strings, we could decide which option leads to the best solution and choose that option accordingly. We can learn this cost, through the awesome thing that's recursion:
#define MATCH 0 /* enumerated type symbol for match */
#define INSERT 1 /* enumerated type symbol for insert */
#define DELETE 2 /* enumerated type symbol for delete */
int string_compare(char *s, char *t, int i, int j)
{
int k; /* counter */
int opt[3]; /* cost of the three options */
int lowest_cost; /* lowest cost */
if (i == 0) return(j * indel(’ ’));
if (j == 0) return(i * indel(’ ’));
opt[MATCH] = string_compare(s,t,i-1,j-1) +
match(s[i],t[j]);
opt[INSERT] = string_compare(s,t,i,j-1) +
indel(t[j]);
opt[DELETE] = string_compare(s,t,i-1,j) +
indel(s[i]);
lowest_cost = opt[MATCH];
for (k=INSERT; k<=DELETE; k++)
if (opt[k] < lowest_cost) lowest_cost = opt[k];
return( lowest_cost );
}
This algorithm is correct, but is also impossibly slow.
Running on our computer, it takes several seconds to compare two 11-character strings, and the computation disappears into never-never land on anything longer.
Why is the algorithm so slow? It takes exponential time because it recomputes values again and again and again. At every position in the string, the recursion branches three ways, meaning it grows at a rate of at least 3^n – indeed, even faster since most of the calls reduce only one of the two indices, not both of them.
So how can we make the algorithm practical? The important observation is that most of these recursive calls are computing things that have already been computed before. How do we know? Well, there can only be |s| · |t| possible unique recursive calls, since there are only that many distinct (i, j) pairs to serve as the parameters of recursive calls.
By storing the values for each of these (i, j) pairs in a table, we can
avoid recomputing them and just look
them up as needed.
The table is a two-dimensional matrix m where each of the |s|·|t| cells contains the cost of the optimal solution of this subproblem, as well as a parent pointer explaining how we got to this location:
typedef struct {
int cost; /* cost of reaching this cell */
int parent; /* parent cell */
} cell;
cell m[MAXLEN+1][MAXLEN+1]; /* dynamic programming table */
The dynamic programming version has three differences from the recursive version.
First, it gets its intermediate values using table lookup instead of recursive calls.
**Second,**it updates the parent field of each cell, which will enable us to reconstruct the edit sequence later.
**Third,**Third, it is instrumented using a more general goal cell() function instead of just returning m[|s|][|t|].cost. This will enable us to apply this routine to a wider class of problems.
Here, a very particular analysis of what it takes to gather the most optimal partial results, is what makes the solution a "dynamic" one.
Here's an alternate, full solution to the same problem. It's also a "dynamic" one even though its execution is different. I suggest you check out how efficient the solution is by submitting it to UVA's online judge. I find amazing how such a heavy problem was tackled so efficiently.
The key bits of dynamic programming are "overlapping sub-problems" and "optimal substructure". These properties of a problem mean that an optimal solution is composed of the optimal solutions to its sub-problems. For instance, shortest path problems exhibit optimal substructure. The shortest path from A to C is the shortest path from A to some node B followed by the shortest path from that node B to C.
In greater detail, to solve a shortest-path problem you will:
find the distances from the starting node to every node touching it (say from A to B and C)
find the distances from those nodes to the nodes touching them (from B to D and E, and from C to E and F)
we now know the shortest path from A to E: it is the shortest sum of A-x and x-E for some node x that we have visited (either B or C)
repeat this process until we reach the final destination node
Because we are working bottom-up, we already have solutions to the sub-problems when it comes time to use them, by memoizing them.
Remember, dynamic programming problems must have both overlapping sub-problems, and optimal substructure. Generating the Fibonacci sequence is not a dynamic programming problem; it utilizes memoization because it has overlapping sub-problems, but it does not have optimal substructure (because there is no optimization problem involved).
Dynamic Programming
Definition
Dynamic programming (DP) is a general algorithm design technique for solving
problems with overlapping sub-problems. This technique was invented by American
mathematician “Richard Bellman” in 1950s.
Key Idea
The key idea is to save answers of overlapping smaller sub-problems to avoid recomputation.
Dynamic Programming Properties
An instance is solved using the solutions for smaller instances.
The solutions for a smaller instance might be needed multiple times,
so store their results in a table.
Thus each smaller instance is solved only once.
Additional space is used to save time.
I am also very much new to Dynamic Programming (a powerful algorithm for particular type of problems)
In most simple words, just think dynamic programming as a recursive approach with using the previous knowledge
Previous knowledge is what matters here the most, Keep track of the solution of the sub-problems you already have.
Consider this, most basic example for dp from Wikipedia
Finding the fibonacci sequence
function fib(n) // naive implementation
if n <=1 return n
return fib(n − 1) + fib(n − 2)
Lets break down the function call with say n = 5
fib(5)
fib(4) + fib(3)
(fib(3) + fib(2)) + (fib(2) + fib(1))
((fib(2) + fib(1)) + (fib(1) + fib(0))) + ((fib(1) + fib(0)) + fib(1))
(((fib(1) + fib(0)) + fib(1)) + (fib(1) + fib(0))) + ((fib(1) + fib(0)) + fib(1))
In particular, fib(2) was calculated three times from scratch. In larger examples, many more values of fib, or sub-problems, are recalculated, leading to an exponential time algorithm.
Now, lets try it by storing the value we already found out in a data-structure say a Map
var m := map(0 → 0, 1 → 1)
function fib(n)
if key n is not in map m
m[n] := fib(n − 1) + fib(n − 2)
return m[n]
Here we are saving the solution of sub-problems in the map, if we don't have it already. This technique of saving values which we already had calculated is termed as Memoization.
At last, For a problem, first try to find the states (possible sub-problems and try to think of the better recursion approach so that you can use the solution of previous sub-problem into further ones).
Dynamic programming is a technique for solving problems with overlapping sub problems.
A dynamic programming algorithm solves every sub problem just once and then
Saves its answer in a table (array).
Avoiding the work of re-computing the answer every time the sub problem is encountered.
The underlying idea of dynamic programming is:
Avoid calculating the same stuff twice, usually by keeping a table of known results of sub problems.
The seven steps in the development of a dynamic programming algorithm are as follows:
Establish a recursive property that gives the solution to an instance of the problem.
Develop a recursive algorithm as per recursive property
See if same instance of the problem is being solved again an again in recursive calls
Develop a memoized recursive algorithm
See the pattern in storing the data in the memory
Convert the memoized recursive algorithm into iterative algorithm
Optimize the iterative algorithm by using the storage as required (storage optimization)
in short the difference between recursion memoization and Dynamic programming
Dynamic programming as name suggest is using the previous calculated value to dynamically construct the next new solution
Where to apply dynamic programming : If you solution is based on optimal substructure and overlapping sub problem then in that case using the earlier calculated value will be useful so you do not have to recompute it. It is bottom up approach. Suppose you need to calculate fib(n) in that case all you need to do is add the previous calculated value of fib(n-1) and fib(n-2)
Recursion : Basically subdividing you problem into smaller part to solve it with ease but keep it in mind it does not avoid re computation if we have same value calculated previously in other recursion call.
Memoization : Basically storing the old calculated recursion value in table is known as memoization which will avoid re-computation if its already been calculated by some previous call so any value will be calculated once. So before calculating we check whether this value has already been calculated or not if already calculated then we return the same from table instead of recomputing. It is also top down approach
Here is a simple python code example of Recursive, Top-down, Bottom-up approach for Fibonacci series:
Recursive: O(2n)
def fib_recursive(n):
if n == 1 or n == 2:
return 1
else:
return fib_recursive(n-1) + fib_recursive(n-2)
print(fib_recursive(40))
Top-down: O(n) Efficient for larger input
def fib_memoize_or_top_down(n, mem):
if mem[n] is not 0:
return mem[n]
else:
mem[n] = fib_memoize_or_top_down(n-1, mem) + fib_memoize_or_top_down(n-2, mem)
return mem[n]
n = 40
mem = [0] * (n+1)
mem[1] = 1
mem[2] = 1
print(fib_memoize_or_top_down(n, mem))
Bottom-up: O(n) For simplicity and small input sizes
def fib_bottom_up(n):
mem = [0] * (n+1)
mem[1] = 1
mem[2] = 1
if n == 1 or n == 2:
return 1
for i in range(3, n+1):
mem[i] = mem[i-1] + mem[i-2]
return mem[n]
print(fib_bottom_up(40))

Resources