Calculate the sum of a series - algorithm

I need to calculate the sum of an infinite series using mixed and recursive methods. What is the difference between these two methods?
My code, below, shows how I am doing it. Which method am I using?
For example, to compute the series
Sum = -X -(X^2/2) -(X^3/3) -(X^4/4)....etc
I would use this code
sum := -x;
numerator:= x;
n := 2;
current := -x;
repeat
numerator := numerator * x;
previous := current;
current := numerator/n;
n := n + 1;
sum := sum - current;
until ( abs(previous-current) < eps )

Your problem/question is too vague / too general. I cannot offer more, therefore, than a few general remarks:
For starters, a general method to sum "any" infinite series does not exist. For each series individually you will have to determine how to sum THAT PARTICULAR ONE and this requires, first of all, a study of its convergence-characteristics: a series may converge, diverge, or conditionally converge. Simply adding terms until a term gets smaller than some limit, or until the difference between successive terms becomes smaller than some limit is not a guarantee that you're close to the limiting sum. In fact, it doesn't even guarantee that the sum is finite at all (consider the series 1 + 1/2 + 1/3 + 1/4 ... for example).
Now, let's view your example: -sum( x^n/n; n=1..inf ).
This particular series doesn't have a finite sum for any x>=1 and neither for x<-1: it doesn't converge unless -1<=x<1, the terms get larger and larger... (however, read on!).
For abs(x)<1 a 'straightforward' approach of adding successive terms will 'in the end' give you the correct answer, but it will take a long while before you get close to the limiting sum, unless x is very small, and assessing HOW close you are with any finite sub-sum is far from trivial. Moreover, there are better (=faster-converging) methods to sum such types of series.
In this specific case, you may note that it is log(1-x), expressed in a Maclaurin series expansion, so there's not need to set up a tedious summation at all, because the result of the infinite summation is already known.
Now, consider at the one hand that we can easily see that the terms will get bigger and bigger for higher 'n' whenever abs(x) is greater than 1, so that any simple summing-procedure is bound to fail.
At the other hand we have this Maclaurin expansion for {log(1-x); -1<=x<1} and we may ponder how it all fits in with the knowledge that surely log(1-x) also exists and is finite for x=-4: could we maybe 'define' the limit of the summation also for x<-1 by this logarithm?! Enter the wonderful world of analytic continuation. I won't go into this, it would take far too much space here.
All in all, summing an infinite series is an art, not something to throw into a standard-summing-machine. Consequently, without specifying which series you wish to sum, you cannot a priori say what method one should apply.
Finally, I do not know what you mean by a "mixed method", so I cannot comment on that, or on its comparison against a recursive method. A recursive method might come in whenever you can write your series in a form that is very similar to the original, but just 'slightly simpler'. An example, not from an infinite series, but from a finite series: the Fibonacci number F(n) can be defined as the finite sum F(N-1)+F(n-2). That is recursion and you 'only' have to know some base-value(s) - i.c: F(0)=F(1)=1 - and there you have your recurrence set-up. Rewriting a series in a recursive form may help to find an analytic solution, or to split off a part that has an analytic solution leaving a 'more convenient' series that lends itself to a fast-converging numerical approach.
Maybe "mixed method" is intended to indicate a mixture of an analytical summation - as with your series: log(1-x) - and some (smart or brute-force) numerical approximation (where, as others pointed out, 'recursive' might be meant to be 'iterative').
To conclude: (a) clarify what you mean by "mixed" and "recursive" methods; (b) be specific about what type of series you need to sum, lest there's no sensible answer possible.

Some parts of an answer to your question:
I don't know what you mean by a 'mixed' method, though if you had 2 methods and made a new one out of bits of both of them then I guess I could see that you would then have a mixed method. But as a generally-used, off-the-shelf term, it's meaningless to me. Since you contrast it with 'recursive' and since I've already decided that you are not a native English speaker I wonder if you meant to write 'iterative' ? It is very common to find 'iterative' and 'recursive' methods compared and contrasted.
What you have shown us is an 'iterative' method; a simple view is that an iterative method is one which relies on a loop (or loops) as your code does.
Incidentally, I think you could make your code simpler if you recognised that the first term in your series has the same form as all the other terms, you have simplified (X^1)/1 to X which is mathematically correct, but computationally it is often more straightforward to operate on a sequence of identical terms rather than a sequence where the first term differs in form from all the rest.
A recursive method is one which calls itself. Since I suspect that I am helping you with homework I'm not going to write a recursive method for you, but you should be looking for a function which has the approximate form:
sum([]) = 0
sum([a b c d ...]) = a + sum([b c d ...])
Note that the 'function' sum (which is defined in 2 'clauses') appears on both the left-hand side and righ-hand side of the 2nd clause. Note also that on the right it is applied to a sub-set of the input arguments (on the left) which offers the possibility that at some stage the function will terminate.

The seris that you have presented
-X -(X^2/2) -(X^3/3) -(X^4/4)...
Can be writen as
-(X^1/1) - (X^2/2) - (X^3/3) ... -(X^n/n)
This provide us to
Repeate -(X^i/i) until n where n is abs(previous-current) < eps increasing i each time
This give us:
I hope that bellow code meet your expectation.
i := 1;
sum := 0;
current := x;
repeat
previous := current;
current := - exp(x, i) / i; {Here you call a function exp that realise x^i}
sum := sum + current;
i := i + 1;
until ( abs(previous-current) < eps )

Related

Determining the number of steps in an algorithm

I was going through my Data Structures and Algorithms notes, and came across the following examples regarding Time Complexity and Big-O Notation: The columns on the left count the number of operations carried out in each line. I didn't understand why almost all the lines in the first example have a multiple of 2 in front of them, whereas the other two examples don't. Obviously this doesn't affect the resulting O(n), but I would still like to know where the 2 came form.
I can only find one explanation for this: the sloppiness of the author of the slides.
In a proper analysis one have to explain what kind of operations are performed at which time for what input (like for example this book on page 21). Without this you can not even be sure whether we count multiplication of 2 numbers as 1 operation or 2 or something else?
These slides are inconsistent. For example:
In slide1 currentMax = A[0] takes 2 operations. Kind of makes sense if you take finding 0-th element in array as 1 operation and assigning as another one. But in the slide3 n iterations of s = s + X[i] takes n operations. Which means that s = s + X[i] takes 1 operation. Also kind of makes sense we just increase one counter.
But it is totally inconsistent with each other, because it doesn't makes sense that a = X[0] is 2 operations, and a = a + X[0] where you do more takes only 1.

Expectation Maximization coin toss examples

I've been self-studying the Expectation Maximization lately, and grabbed myself some simple examples in the process:
http://cs.dartmouth.edu/~cs104/CS104_11.04.22.pdf
There are 3 coins 0, 1 and 2 with P0, P1 and P2 probability landing on Head when tossed. Toss coin 0, if the result is Head, toss coin 1 three times else toss coin 2 three times. The observed data produced by coin 1 and 2 is like this: HHH, TTT, HHH, TTT, HHH. The hidden data is coin 0's result. Estimate P0, P1 and P2.
http://ai.stanford.edu/~chuongdo/papers/em_tutorial.pdf
There are two coins A and B with PA and PB being the probability landing on Head when tossed. Each round, select one coin at random and toss it 10 times then record the results. The observed data is the toss results provided by these two coins. However, we don't know which coin was selected for a particular round. Estimate PA and PB.
While I can get the calculations, I can't relate the ways they are solved to the original EM theory. Specifically, during the M-Step of both examples, I don't see how they're maximizing anything. It just seems they are recalculating the parameters and somehow, the new parameters are better than the old ones. Moreover, the two E-Steps don't even look similar to each other, not to mention the original theory's E-Step.
So how exactly do these example work?
The second PDF won't download for me, but I also visited the wikipedia page http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm which has more information. http://melodi.ee.washington.edu/people/bilmes/mypapers/em.pdf (which claims to be a gentle introduction) might be worth a look too.
The whole point of the EM algorithm is to find parameters which maximize the likelihood of the observed data. This is the only bullet point on page 8 of the first PDF, the equation for capital Theta subscript ML.
The EM algorithm comes in handy where there is hidden data which would make the problem easy if you knew it. In the three coins example this is the result of tossing coin 0. If you knew the outcome of that you could (of course) produce an estimate for the probability of coin 0 turning up heads. You would also know whether coin 1 or coin 2 was tossed three times in the next stage, which would allow you to make estimates for the probabilities of coin 1 and coin 2 turning up heads. These estimates would be justified by saying that they maximized the likelihood of the observed data, which would include not only the results that you are given, but also the hidden data that you are not - the results from coin 0. For a coin that gets A heads and B tails you find that the maximum likelihood for the probability of A heads is A/(A+B) - it might be worth you working this out in detail, because it is the building block for the M step.
In the EM algorithm you say that although you don't know the hidden data, you come in with probability estimates which allow you to write down a probability distribution for it. For each possible value of the hidden data you could find the parameter values which would optimize the log likelihood of the data including the hidden data, and this almost always turns out to mean calculating some sort of weighted average (if it doesn't the EM step may be too difficult to be practical).
What the EM algorithm asks you to do is to find the parameters maximizing the weighted sum of log likelihoods given by all the possible hidden data values, where the weights are given by the probability of the associated hidden data given the observations using the parameters at the start of the EM step. This is what almost everybody, including the Wikipedia algorithm, calls the Q-function. The proof behind the EM algorithm, given in the Wikipedia article, says that if you change the parameters so as to increase the Q-function (which is only a means to an end), you will also have changed them so as to increase the likelihood of the observed data (which you do care about). What you tend to find in practice is that you can maximize the Q-function using a variation of what you would do if you know the hidden data, but using the probabilities of the hidden data, given the estimates at the start of the EM-step, to weight the observations in some way.
In your example it means totting up the number of heads and tails produced by each coin. In the PDF they work out P(Y=H|X=) = 0.6967. This means that you use weight 0.6967 for the case Y=H, which means that you increment the counts for Y=H by 0.6967 and increment the counts for X=H in coin 1 by 3*0.6967, and you increment the counts for Y=T by 0.3033 and increment the counts for X=H in coin 2 by 3*0.3033. If you have a detailed justification for why A/(A+B) is a maximum likelihood of coin probabilities in the standard case, you should be ready to turn it into a justification for why this weighted updating scheme maximizes the Q-function.
Finally, the log likelihood of the observed data (the thing you are maximizing) gives you a very useful check. It should increase with every EM step, at least until you get so close to convergence that rounding error comes in, in which case you may have a very small decrease, signalling convergence. If it decreases dramatically, you have a bug in your program or your maths.
As luck would have it, I have been struggling with this material recently as well. Here is how I have come to think of it:
Consider a related, but distinct algorithm called the classify-maximize algorithm, which we might use as a solution technique for a mixture model problem. A mixture model problem is one where we have a sequence of data that may be produced by any of N different processes, of which we know the general form (e.g., Gaussian) but we do not know the parameters of the processes (e.g., the means and/or variances) and may not even know the relative likelihood of the processes. (Typically we do at least know the number of the processes. Without that, we are into so-called "non-parametric" territory.) In a sense, the process which generates each data is the "missing" or "hidden" data of the problem.
Now, what this related classify-maximize algorithm does is start with some arbitrary guesses at the process parameters. Each data point is evaluated according to each one of those parameter processes, and a set of probabilities is generated-- the probability that the data point was generated by the first process, the second process, etc, up to the final Nth process. Then each data point is classified according to the most likely process.
At this point, we have our data separated into N different classes. So, for each class of data, we can, with some relatively simple calculus, optimize the parameters of that cluster with a maximum likelihood technique. (If we tried to do this on the whole data set prior to classifying, it is usually analytically intractable.)
Then we update our parameter guesses, re-classify, update our parameters, re-classify, etc, until convergence.
What the expectation-maximization algorithm does is similar, but more general: Instead of a hard classification of data points into class 1, class 2, ... through class N, we are now using a soft classification, where each data point belongs to each process with some probability. (Obviously, the probabilities for each point need to sum to one, so there is some normalization going on.) I think we might also think of this as each process/guess having a certain amount of "explanatory power" for each of the data points.
So now, instead of optimizing the guesses with respect to points that absolutely belong to each class (ignoring the points that absolutely do not), we re-optimize the guesses in the context of those soft classifications, or those explanatory powers. And it so happens that, if you write the expressions in the correct way, what you're maximizing is a function that is an expectation in its form.
With that said, there are some caveats:
1) This sounds easy. It is not, at least to me. The literature is littered with a hodge-podge of special tricks and techniques-- using likelihood expressions instead of probability expressions, transforming to log-likelihoods, using indicator variables, putting them in basis vector form and putting them in the exponents, etc.
These are probably more helpful once you have the general idea, but they can also obfuscate the core ideas.
2) Whatever constraints you have on the problem can be tricky to incorporate into the framework. In particular, if you know the probabilities of each of the processes, you're probably in good shape. If not, you're also estimating those, and the sum of the probabilities of the processes must be one; they must live on a probability simplex. It is not always obvious how to keep those constraints intact.
3) This is a sufficiently general technique that I don't know how I would go about writing code that is general. The applications go far beyond simple clustering and extend to many situations where you are actually missing data, or where the assumption of missing data may help you. There is a fiendish ingenuity at work here, for many applications.
4) This technique is proven to converge, but the convergence is not necessarily to the global maximum; be wary.
I found the following link helpful in coming up with the interpretation above: Statistical learning slides
And the following write-up goes into great detail of some painful mathematical details: Michael Collins' write-up
I wrote the below code in Python which explains the example given in your second example paper by Do and Batzoglou.
I recommend that you read this link first for a clear explanation of how and why the 'weightA' and 'weightB' in the code below are obtained.
Disclaimer : The code does work but I am certain that it is not coded optimally. I am not a Python coder normally and have started using it two weeks ago.
import numpy as np
import math
#### E-M Coin Toss Example as given in the EM tutorial paper by Do and Batzoglou* ####
def get_mn_log_likelihood(obs,probs):
""" Return the (log)likelihood of obs, given the probs"""
# Multinomial Distribution Log PMF
# ln (pdf) = multinomial coeff * product of probabilities
# ln[f(x|n, p)] = [ln(n!) - (ln(x1!)+ln(x2!)+...+ln(xk!))] + [x1*ln(p1)+x2*ln(p2)+...+xk*ln(pk)]
multinomial_coeff_denom= 0
prod_probs = 0
for x in range(0,len(obs)): # loop through state counts in each observation
multinomial_coeff_denom = multinomial_coeff_denom + math.log(math.factorial(obs[x]))
prod_probs = prod_probs + obs[x]*math.log(probs[x])
multinomial_coeff = math.log(math.factorial(sum(obs))) - multinomial_coeff_denom
likelihood = multinomial_coeff + prod_probs
return likelihood
# 1st: Coin B, {HTTTHHTHTH}, 5H,5T
# 2nd: Coin A, {HHHHTHHHHH}, 9H,1T
# 3rd: Coin A, {HTHHHHHTHH}, 8H,2T
# 4th: Coin B, {HTHTTTHHTT}, 4H,6T
# 5th: Coin A, {THHHTHHHTH}, 7H,3T
# so, from MLE: pA(heads) = 0.80 and pB(heads)=0.45
# represent the experiments
head_counts = np.array([5,9,8,4,7])
tail_counts = 10-head_counts
experiments = zip(head_counts,tail_counts)
# initialise the pA(heads) and pB(heads)
pA_heads = np.zeros(100); pA_heads[0] = 0.60
pB_heads = np.zeros(100); pB_heads[0] = 0.50
# E-M begins!
delta = 0.001
j = 0 # iteration counter
improvement = float('inf')
while (improvement>delta):
expectation_A = np.zeros((5,2), dtype=float)
expectation_B = np.zeros((5,2), dtype=float)
for i in range(0,len(experiments)):
e = experiments[i] # i'th experiment
ll_A = get_mn_log_likelihood(e,np.array([pA_heads[j],1-pA_heads[j]])) # loglikelihood of e given coin A
ll_B = get_mn_log_likelihood(e,np.array([pB_heads[j],1-pB_heads[j]])) # loglikelihood of e given coin B
weightA = math.exp(ll_A) / ( math.exp(ll_A) + math.exp(ll_B) ) # corresponding weight of A proportional to likelihood of A
weightB = math.exp(ll_B) / ( math.exp(ll_A) + math.exp(ll_B) ) # corresponding weight of B proportional to likelihood of B
expectation_A[i] = np.dot(weightA, e)
expectation_B[i] = np.dot(weightB, e)
pA_heads[j+1] = sum(expectation_A)[0] / sum(sum(expectation_A));
pB_heads[j+1] = sum(expectation_B)[0] / sum(sum(expectation_B));
improvement = max( abs(np.array([pA_heads[j+1],pB_heads[j+1]]) - np.array([pA_heads[j],pB_heads[j]]) ))
j = j+1
The key to understanding this is knowing what the auxiliary variables are that make estimation trivial. I will explain the first example quickly, the second follows a similar pattern.
Augment each sequence of heads/tails with two binary variables, which indicate whether coin 1 was used or coin 2. Now our data looks like the following:
c_11 c_12
c_21 c_22
c_31 c_32
...
For each i, either c_i1=1 or c_i2=1, with the other being 0. If we knew the values these variables took in our sample, estimation of parameters would be trivial: p1 would be the proportion of heads in samples where c_i1=1, likewise for c_i2, and \lambda would be the mean of the c_i1s.
However, we don't know the values of these binary variables. So, what we basically do is guess them (in reality, take their expectation), and then update the parameters in our model assuming our guesses were correct. So the E step is to take the expectation of the c_i1s and c_i2s. The M step is to take maximum likelihood estimates of p_1, p_2 and \lambda given these cs.
Does that make a bit more sense? I can write out the updates for the E and M step if you prefer. EM then just guarantees that by following this procedure, likelihood will never decrease as iterations increase.

Is there a Sorting Algorithm that sorts in O(∞) permutations?

After reading this question and through the various Phone Book sorting scenarios put forth in the answer, I found the concept of the BOGO sort to be quite interesting. Certainly there is no use for this type of sorting algorithm but it did raise an interesting question in my mind-- could their be a sorting algorithm that is infinitely impossible to complete?
In other words, is there a process where one could attempt to compare and re-order a fixed set of data and can yet never achieve an actual sorted list?
This is much more of a theoretical/philosophical question than a practical one and if I was more of a mathematician I'd probably be able to prove/disprove such a possibility. Has anyone asked this question before and if so, what can be said about it?
[edit:] no deterministic process with a finite amount of state takes "O(infinity)" since the slowest it can be is to progress through all possible states. this includes sorting.
[earlier, more specific answer:]
no. for a list of size n you only have state space of size n! in which to store progress (assuming that the entire state of the sort is stored in the ordering of the elements and it really is "doing something," deterministically).
so the worst possible behaviour would cycle through all available states before terminating and take time proportional to n! (at the risk of confusing matters, there must be a single path through the state - since that is "all the state" you cannot have a process move from state X to Y, and then later from state X to Z, since that requires additional state, or is non-deterministic)
Idea 1:
function sort( int[] arr ) {
int[] sorted = quicksort( arr ); // compare and reorder data
while(true); // where'd this come from???
return sorted; // return answer
}
Idea 2
How do you define O(infinity)? The formal definition of Big-O merely states that f(x)=O(g(x)) implies that M*g(x) is an upper bound of f(x) given sufficiently large x and some constant M.
Typically when you talking about "infinity", you are talking about some sort of unbounded limit. So in this case, the only reasonable definition is saying that O(infinity) is O(function that's larger than every function). Obviously a function that's larger than every function is an upper bound. Thus technically everything is "O(infinity)"
Idea 3
Assuming you mean theta notation (tight bound)...
If you impose the additional restriction that the algorithm is smart (returns when it finds a sorted permutation) and every permutation of the list must be visited in a finite amount of time, then the answer no. There are only N! permutations of a list. The upper bound for such a sorting algorithm is then a finite over finite numbers, which is finite.
Your question doesn't really have much to do with sorting. An algorithm which is guaranteed never to complete would be pretty dull. Indeed, even an algorithm which would might or might not ever complete would be pretty dull. Much more interesting would be an algorithm which would be guaranteed to complete, eventually, but whose worst-case computation time with respect to the size of the input would not be expressible as O(F(N)) for any function F that could itself be computed in bounded time. My hunch would be that such an algorithm could be devised, but I'm not sure how.
How about this one:
Start at the first item.
Flip a coin.
If it's heads, switch it with the next item.
If it's tails, don't switch them.
If list is sorted, stop.
If not, move onto the next pair ...
It's a sorting algorithm -- the kind a monkey might do. Is there any guarantee that you'll arrive at a sorted list? I don't think so!
Yes -
SortNumbers(collectionOfNumbers)
{
If IsSorted(collectionOfNumbers){
reverse(collectionOfNumbers(1:end/2))
}
return SortNumbers(collectionOfNumbers)
}
Input: A[1..n] : n unique integers in arbitrary order
Output: A'[1..n] : reordering of the elements of A
such that A'[i] R(A') A'[j] if i < j.
Comparator: a R(A') b iff A'[i] = a, A'[j] = b and i > j
More generally, make the comparator something that's either (a) impossible to reconcile with the output specification, so that no solution can exist, or (b) uncomputable (e.g., sort these (input, turing machine) pairs in order of the number of steps needed for the machine to halt on the input).
Even more generally, if you have a procedure that fails to halt on a valid input, the procedure is not an algorithm which solves the problem on that input/output domain... which means you don't have an algorithm at all, or that what you have is only an algorithm if you appropriately restrict the domain.
Let's suppose that you have a random coin flipper, infinite arithmetic, and infinite rationals. Then the answer is yes. You can write a sorting algorithm which has 100% chance of successfully sorting your data (so it really is a sorting function), but which on average will take infinite time to do so.
Here is an emulation of this in Python.
# We'll pretend that these are true random numbers.
import random
import fractions
def flip ():
return 0.5 < random.random()
# This tests whether a number is less than an infinite precision number in the range
# [0, 1]. It has a 100% probability of returning an answer.
def number_less_than_rand (x):
high = fractions.Fraction(1, 1)
low = fractions.Fraction(0, 1)
while low < x and x < high:
if flip():
low = (low + high) / 2
else:
high = (low + high) / 2
return high < x
def slow_sort (some_array):
n = fractions.Fraction(100, 1)
# This loop has a 100% chance of finishing, but its average time to complete
# is also infinite. If you haven't studied infinite series and products, you'll
# just have to take this on faith. Otherwise proving that is a fun exercise.
while not number_less_than_rand(1/n):
n += 1
print n
some_array.sort()

Most efficient algorithm to compute a common numerator of a sum of fractions

I'm pretty sure that this is the right site for this question, but feel free to move it to some other stackexchange site if it fits there better.
Suppose you have a sum of fractions a1/d1 + a2/d2 + … + an/dn. You want to compute a common numerator and denominator, i.e., rewrite it as p/q. We have the formula
p = a1*d2*…*dn + d1*a2*d3*…*dn + … + d1*d2*…d(n-1)*an
q = d1*d2*…*dn.
What is the most efficient way to compute these things, in particular, p? You can see that if you compute it naïvely, i.e., using the formula I gave above, you compute a lot of redundant things. For example, you will compute d1*d2 n-1 times.
My first thought was to iteratively compute d1*d2, d1*d2*d3, … and dn*d(n-1), dn*d(n-1)*d(n-2), … but even this is inefficient, because you will end up computing multiplications in the "middle" twice (e.g., if n is large enough, you will compute d3*d4 twice).
I'm sure this problem could be expressed somehow using maybe some graph theory or combinatorics, but I haven't studied enough of that stuff to have a good feel for it.
And one note: I don't care about cancelation, just the most efficient way to multiply things.
UPDATE:
I should have known that people on stackoverflow would be assuming that these were numbers, but I've been so used to my use case that I forgot to mention this.
We cannot just "divide" out an from each term. The use case here is a symbolic system. Actually, I am trying to fix a function called .as_numer_denom() in the SymPy computer algebra system which presently computes this the naïve way. See the corresponding SymPy issue.
Dividing out things has some problems, which I would like to avoid. First, there is no guarantee that things will cancel. This is because mathematically, (a*b)**n != a**n*b**n in general (if a and b are positive it holds, but e.g., if a == b ==-1 and n == 1/2, you get (a*b)**n == 1**(1/2) == 1 but (-1)**(1/2)*(-1)**(1/2) == I*I == -1). So I don't think it's a good idea to assume that dividing by an will cancel it in the expression (this may be actually be unfounded, I'd need to check what the code does).
Second, I'd like to also apply a this algorithm to computing the sum of rational functions. In this case, the terms would automatically be multiplied together into a single polynomial, and "dividing" out each an would involve applying the polynomial division algorithm. You can see in this case, you really do want to compute the most efficient multiplication in the first place.
UPDATE 2:
I think my fears for cancelation of symbolic terms may be unfounded. SymPy does not cancel things like x**n*x**(m - n) automatically, but I think that any exponents that would combine through multiplication would also combine through division, so powers should be canceling.
There is an issue with constants automatically distributing across additions, like:
In [13]: 2*(x + y)*z*(S(1)/2)
Out[13]:
z⋅(2⋅x + 2⋅y)
─────────────
2
But this is first a bug and second could never be a problem (I think) because 1/2 would be split into 1 and 2 by the algorithm that gets the numerator and denominator of each term.
Nonetheless, I still want to know how to do this without "dividing out" di from each term, so that I can have an efficient algorithm for summing rational functions.
Instead of adding up n quotients in one go I would use pairwise addition of quotients.
If things cancel out in partial sums then the numbers or polynomials stay smaller, which makes computation faster.
You avoid the problem of computing the same product multiple times.
You could try to order the additions in a certain way, to make canceling more likely (maybe add quotients with small denominators first?), but I don't know if this would be worthwhile.
If you start from scratch this is simpler to implement, though I'm not sure it fits as a replacement of the problematic routine in SymPy.
Edit: To make it more explicit, I propose to compute a1/d1 + a2/d2 + … + an/dn as (…(a1/d1 + a2/d2) + … ) + an/dn.
Compute two new arrays:
The first contains partial multiples to the left: l[0] = 1, l[i] = l[i-1] * d[i]
The second contains partial multiples to the right: r[n-1] = 1, r[i] = d[i] * r[i+1]
In both cases, 1 is the multiplicative identity of whatever ring you are working in.
Then each of your terms on the top, t[i] = l[i-1] * a[i] * r[i+1]
This assumes multiplication is associative, but it need not be commutative.
As a first optimization, you don't actually have to create r as an array: you can do a first pass to calculate all the l values, and accumulate the r values during a second (backward) pass to calculate the summands. No need to actually store the r values since you use each one once, in order.
In your question you say that this computes d3*d4 twice, but it doesn't. It does multiply two different values by d4 (one a right-multiplication and the other a left-multiplication), but that's not exactly a repeated operation. Anyway, the total number of multiplications is about 4*n, vs. 2*n multiplications and n divisions for the other approach that doesn't work in non-commutative multiplication or non-field rings.
If you want to compute p in the above expression, one way to do this would be to multiply together all of the denominators (in O(n), where n is the number of fractions), letting this value be D. Then, iterate across all of the fractions and for each fraction with numerator ai and denominator di, compute ai * D / di. This last term is equal to the product of the numerator of the fraction and all of the denominators other than its own. Each of these terms can be computed in O(1) time (assuming you're using hardware multiplication, otherwise it might take longer), and you can sum them all up in O(n) time.
This gives an O(n)-time algorithm for computing the numerator and denominator of the new fraction.
It was also pointed out to me that you could manually sift out common denominators and combine those trivially without multiplication.

Minimize a function

Suppose you are given a function of a single variable and arguments a and b and are asked to find the minimum value that the function takes on the interval [a, b]. (You can assume that the argument is a double, though in my application I may need to use an arbitrary-precision library.)
In general this is a hard problem because functions can be weird. A simple version of this problem would be to minimize the function assuming that it is continuous (no gaps or jumps) and single-peaked (there is a unique minimum; to the left of the minimum the function is decreasing and to the right it is increasing). Is there a good way to solve this easier (but perhaps not easy!) problem?
Assume that the function may be difficult to calculate but not particularly expensive to store an answer that you've computed. (Obviously, it's better if you don't have to make giant arrays of key/value pairs.)
Bonus points for good ideas on improving the algorithm in the fortunate case in which it's nice (e.g.: derivative exists, function is smooth/analytic, derivative can be computed in closed form, derivative can be computed at no cost when the function is evaluated).
The version you describe, with a single minimum, is easy to solve.
The idea is this. Suppose that I have 3 points with a < b < c and f(b) < f(a) and f(b) < f(c). Then the true minimum is between a and c. Furthermore if I pick another point d somewhere in the interval, then I can throw away one of a or d and still have an interval with the true minimum in the middle. My approximations will improve exponentially quickly as I do more iterations.
We don't quite start with this. We start with 2 points, a and b, and know that the answer is somewhere in the middle. Take the mid-point. If f there is below the end points, we're into the case I discussed above. Otherwise it must be below one of the end points, and above the other. We can throw away the higher end point and repeat.
If the function is nice, i.e., single-peaked and strictly monotonic (i.e., strictly decreasing to the left of the minimum and strictly increasing to the right), then you can find the minimum with binary search:
Set x = (b-a)/2
test whether x is to the right of the minimum or to the left
if x is left of the minimum:b = x
if x is right of the minimum:a = x
repeat from start until you get bored
the minimum is at x
To test whether x is left/right of the minimum, invent a small value epsilon and check whether f(x - epsilon) < f(x + epsilon). If it is, the minimum is to the left, otherwise it's to the right. By "until you get bored", I mean: invent another small value delta and stop if fabs(f(x - epsilon) - f(x + epsilon)) < delta.
Note that in the general case where you don't know anything about the behavior of a function f, it's not possible to decide a non-trivial property of f. Well, unless you're willing to try all possible inputs. See Rice's Theorem for details.
The Boost project has an implementation of Brent's algorithm that may be useful.
It seems to assume that the function is continuous, and has no maxima (only a minimum) in the input interval.
Not a direct answer but a pointer to more reading:
scipy.optimize: http://docs.scipy.org/doc/scipy/reference/optimize.html
section e04 of naglib: http://www.nag.co.uk/numeric/cl/nagdoc_cl09/html/genint/libconts.html
For the special case where the function is differentiable twice (and the two derivatives can be calculated easily), one can use Newton's method for optimization, i.e. essentially finding the roots of the first derivative (which is a necessary condition for the minimum).
Concerning the general case, note that the extreme case of 'weird' is a function which is continuous nowhere and for which it is very hard if not impossible to find the minimum (in finite time). So I guess you should try to make at least some assumptions about the function you are trying to minimize.
What you want is to optimize an Unimodal function. The correct algorithm is similar to btilly's but you need extra points.
Take 4 points a < b < c < d.
We want to minimize f in [a,d].
If f(b) < f(c) we know the minimum is in [a, c]
If f(b) > f(c) " " " " is in [b, d]
This can give an algorithm by itself, but there is a nice trick involving the golden ratio that allows you to reuse the intermediate values (in a way you only need to compute f once per iteration instead of twice)
If you have an expression for the function, there are global optimization algorithms based on interval analysis.

Resources