Implementation details of a Bayesian classifier - algorithm

I've implemented a simple Bayesian classifier, but I'm running into some overflow problems when using it on non-trivial amounts of data.
One strategy I tried in order to keep the numbers small, but still exact, was to keep reducing the numerator and denominator with the greatest common divisor for every part of the equation. This, however, only works when they have a common divisor...
Note, the problem goes both ways, when I keep the denominators and numerators separate for most of the calculation I struggle with integer overflow, when I do most calculations on the fly, using double arithmetic, I'm met with the various problems/limits that really small double values have (as defined by IEEE 754).
As I'm sure some of you here have implemented this algorithm before, how did you deal with these issues? I'd prefer not to pull in arbitrary precision types as they cost too much and I'm sure there exists a solution which doesn't require them.
Thanks.

Usually the way you handle this is by taking logs and using adds, and then doing an exp if you want to get back into probability space.
p1 * p2 * p3 * ... * pn = exp(log(p1) + log(p2) + log(p3) + ... log(pn))
You avoid under flows by working in log space.

If you're classifying between two categories you can introduce the log ratio of probabilities for each category. So if:
log(Pr(cat1) / Pr(cat2)) <=> 0 # positive would favor cat1 and negative cat2
That is equal to:
log(Pr(cat1)) - log(Pr(cat2)) <=> 0
And if (as in Bayesian classifiers) the category probabilities are themselves products of probabilities given conditions:
log(Pr(cat1|cond1)) + ... <=> log(Pr(cat2|cond1)) + ...
Thus you are dealing with summation rather than multiplication and you will need a massive data set to run into the same thing.

Related

Expectation Maximization coin toss examples

I've been self-studying the Expectation Maximization lately, and grabbed myself some simple examples in the process:
http://cs.dartmouth.edu/~cs104/CS104_11.04.22.pdf
There are 3 coins 0, 1 and 2 with P0, P1 and P2 probability landing on Head when tossed. Toss coin 0, if the result is Head, toss coin 1 three times else toss coin 2 three times. The observed data produced by coin 1 and 2 is like this: HHH, TTT, HHH, TTT, HHH. The hidden data is coin 0's result. Estimate P0, P1 and P2.
http://ai.stanford.edu/~chuongdo/papers/em_tutorial.pdf
There are two coins A and B with PA and PB being the probability landing on Head when tossed. Each round, select one coin at random and toss it 10 times then record the results. The observed data is the toss results provided by these two coins. However, we don't know which coin was selected for a particular round. Estimate PA and PB.
While I can get the calculations, I can't relate the ways they are solved to the original EM theory. Specifically, during the M-Step of both examples, I don't see how they're maximizing anything. It just seems they are recalculating the parameters and somehow, the new parameters are better than the old ones. Moreover, the two E-Steps don't even look similar to each other, not to mention the original theory's E-Step.
So how exactly do these example work?
The second PDF won't download for me, but I also visited the wikipedia page http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm which has more information. http://melodi.ee.washington.edu/people/bilmes/mypapers/em.pdf (which claims to be a gentle introduction) might be worth a look too.
The whole point of the EM algorithm is to find parameters which maximize the likelihood of the observed data. This is the only bullet point on page 8 of the first PDF, the equation for capital Theta subscript ML.
The EM algorithm comes in handy where there is hidden data which would make the problem easy if you knew it. In the three coins example this is the result of tossing coin 0. If you knew the outcome of that you could (of course) produce an estimate for the probability of coin 0 turning up heads. You would also know whether coin 1 or coin 2 was tossed three times in the next stage, which would allow you to make estimates for the probabilities of coin 1 and coin 2 turning up heads. These estimates would be justified by saying that they maximized the likelihood of the observed data, which would include not only the results that you are given, but also the hidden data that you are not - the results from coin 0. For a coin that gets A heads and B tails you find that the maximum likelihood for the probability of A heads is A/(A+B) - it might be worth you working this out in detail, because it is the building block for the M step.
In the EM algorithm you say that although you don't know the hidden data, you come in with probability estimates which allow you to write down a probability distribution for it. For each possible value of the hidden data you could find the parameter values which would optimize the log likelihood of the data including the hidden data, and this almost always turns out to mean calculating some sort of weighted average (if it doesn't the EM step may be too difficult to be practical).
What the EM algorithm asks you to do is to find the parameters maximizing the weighted sum of log likelihoods given by all the possible hidden data values, where the weights are given by the probability of the associated hidden data given the observations using the parameters at the start of the EM step. This is what almost everybody, including the Wikipedia algorithm, calls the Q-function. The proof behind the EM algorithm, given in the Wikipedia article, says that if you change the parameters so as to increase the Q-function (which is only a means to an end), you will also have changed them so as to increase the likelihood of the observed data (which you do care about). What you tend to find in practice is that you can maximize the Q-function using a variation of what you would do if you know the hidden data, but using the probabilities of the hidden data, given the estimates at the start of the EM-step, to weight the observations in some way.
In your example it means totting up the number of heads and tails produced by each coin. In the PDF they work out P(Y=H|X=) = 0.6967. This means that you use weight 0.6967 for the case Y=H, which means that you increment the counts for Y=H by 0.6967 and increment the counts for X=H in coin 1 by 3*0.6967, and you increment the counts for Y=T by 0.3033 and increment the counts for X=H in coin 2 by 3*0.3033. If you have a detailed justification for why A/(A+B) is a maximum likelihood of coin probabilities in the standard case, you should be ready to turn it into a justification for why this weighted updating scheme maximizes the Q-function.
Finally, the log likelihood of the observed data (the thing you are maximizing) gives you a very useful check. It should increase with every EM step, at least until you get so close to convergence that rounding error comes in, in which case you may have a very small decrease, signalling convergence. If it decreases dramatically, you have a bug in your program or your maths.
As luck would have it, I have been struggling with this material recently as well. Here is how I have come to think of it:
Consider a related, but distinct algorithm called the classify-maximize algorithm, which we might use as a solution technique for a mixture model problem. A mixture model problem is one where we have a sequence of data that may be produced by any of N different processes, of which we know the general form (e.g., Gaussian) but we do not know the parameters of the processes (e.g., the means and/or variances) and may not even know the relative likelihood of the processes. (Typically we do at least know the number of the processes. Without that, we are into so-called "non-parametric" territory.) In a sense, the process which generates each data is the "missing" or "hidden" data of the problem.
Now, what this related classify-maximize algorithm does is start with some arbitrary guesses at the process parameters. Each data point is evaluated according to each one of those parameter processes, and a set of probabilities is generated-- the probability that the data point was generated by the first process, the second process, etc, up to the final Nth process. Then each data point is classified according to the most likely process.
At this point, we have our data separated into N different classes. So, for each class of data, we can, with some relatively simple calculus, optimize the parameters of that cluster with a maximum likelihood technique. (If we tried to do this on the whole data set prior to classifying, it is usually analytically intractable.)
Then we update our parameter guesses, re-classify, update our parameters, re-classify, etc, until convergence.
What the expectation-maximization algorithm does is similar, but more general: Instead of a hard classification of data points into class 1, class 2, ... through class N, we are now using a soft classification, where each data point belongs to each process with some probability. (Obviously, the probabilities for each point need to sum to one, so there is some normalization going on.) I think we might also think of this as each process/guess having a certain amount of "explanatory power" for each of the data points.
So now, instead of optimizing the guesses with respect to points that absolutely belong to each class (ignoring the points that absolutely do not), we re-optimize the guesses in the context of those soft classifications, or those explanatory powers. And it so happens that, if you write the expressions in the correct way, what you're maximizing is a function that is an expectation in its form.
With that said, there are some caveats:
1) This sounds easy. It is not, at least to me. The literature is littered with a hodge-podge of special tricks and techniques-- using likelihood expressions instead of probability expressions, transforming to log-likelihoods, using indicator variables, putting them in basis vector form and putting them in the exponents, etc.
These are probably more helpful once you have the general idea, but they can also obfuscate the core ideas.
2) Whatever constraints you have on the problem can be tricky to incorporate into the framework. In particular, if you know the probabilities of each of the processes, you're probably in good shape. If not, you're also estimating those, and the sum of the probabilities of the processes must be one; they must live on a probability simplex. It is not always obvious how to keep those constraints intact.
3) This is a sufficiently general technique that I don't know how I would go about writing code that is general. The applications go far beyond simple clustering and extend to many situations where you are actually missing data, or where the assumption of missing data may help you. There is a fiendish ingenuity at work here, for many applications.
4) This technique is proven to converge, but the convergence is not necessarily to the global maximum; be wary.
I found the following link helpful in coming up with the interpretation above: Statistical learning slides
And the following write-up goes into great detail of some painful mathematical details: Michael Collins' write-up
I wrote the below code in Python which explains the example given in your second example paper by Do and Batzoglou.
I recommend that you read this link first for a clear explanation of how and why the 'weightA' and 'weightB' in the code below are obtained.
Disclaimer : The code does work but I am certain that it is not coded optimally. I am not a Python coder normally and have started using it two weeks ago.
import numpy as np
import math
#### E-M Coin Toss Example as given in the EM tutorial paper by Do and Batzoglou* ####
def get_mn_log_likelihood(obs,probs):
""" Return the (log)likelihood of obs, given the probs"""
# Multinomial Distribution Log PMF
# ln (pdf) = multinomial coeff * product of probabilities
# ln[f(x|n, p)] = [ln(n!) - (ln(x1!)+ln(x2!)+...+ln(xk!))] + [x1*ln(p1)+x2*ln(p2)+...+xk*ln(pk)]
multinomial_coeff_denom= 0
prod_probs = 0
for x in range(0,len(obs)): # loop through state counts in each observation
multinomial_coeff_denom = multinomial_coeff_denom + math.log(math.factorial(obs[x]))
prod_probs = prod_probs + obs[x]*math.log(probs[x])
multinomial_coeff = math.log(math.factorial(sum(obs))) - multinomial_coeff_denom
likelihood = multinomial_coeff + prod_probs
return likelihood
# 1st: Coin B, {HTTTHHTHTH}, 5H,5T
# 2nd: Coin A, {HHHHTHHHHH}, 9H,1T
# 3rd: Coin A, {HTHHHHHTHH}, 8H,2T
# 4th: Coin B, {HTHTTTHHTT}, 4H,6T
# 5th: Coin A, {THHHTHHHTH}, 7H,3T
# so, from MLE: pA(heads) = 0.80 and pB(heads)=0.45
# represent the experiments
head_counts = np.array([5,9,8,4,7])
tail_counts = 10-head_counts
experiments = zip(head_counts,tail_counts)
# initialise the pA(heads) and pB(heads)
pA_heads = np.zeros(100); pA_heads[0] = 0.60
pB_heads = np.zeros(100); pB_heads[0] = 0.50
# E-M begins!
delta = 0.001
j = 0 # iteration counter
improvement = float('inf')
while (improvement>delta):
expectation_A = np.zeros((5,2), dtype=float)
expectation_B = np.zeros((5,2), dtype=float)
for i in range(0,len(experiments)):
e = experiments[i] # i'th experiment
ll_A = get_mn_log_likelihood(e,np.array([pA_heads[j],1-pA_heads[j]])) # loglikelihood of e given coin A
ll_B = get_mn_log_likelihood(e,np.array([pB_heads[j],1-pB_heads[j]])) # loglikelihood of e given coin B
weightA = math.exp(ll_A) / ( math.exp(ll_A) + math.exp(ll_B) ) # corresponding weight of A proportional to likelihood of A
weightB = math.exp(ll_B) / ( math.exp(ll_A) + math.exp(ll_B) ) # corresponding weight of B proportional to likelihood of B
expectation_A[i] = np.dot(weightA, e)
expectation_B[i] = np.dot(weightB, e)
pA_heads[j+1] = sum(expectation_A)[0] / sum(sum(expectation_A));
pB_heads[j+1] = sum(expectation_B)[0] / sum(sum(expectation_B));
improvement = max( abs(np.array([pA_heads[j+1],pB_heads[j+1]]) - np.array([pA_heads[j],pB_heads[j]]) ))
j = j+1
The key to understanding this is knowing what the auxiliary variables are that make estimation trivial. I will explain the first example quickly, the second follows a similar pattern.
Augment each sequence of heads/tails with two binary variables, which indicate whether coin 1 was used or coin 2. Now our data looks like the following:
c_11 c_12
c_21 c_22
c_31 c_32
...
For each i, either c_i1=1 or c_i2=1, with the other being 0. If we knew the values these variables took in our sample, estimation of parameters would be trivial: p1 would be the proportion of heads in samples where c_i1=1, likewise for c_i2, and \lambda would be the mean of the c_i1s.
However, we don't know the values of these binary variables. So, what we basically do is guess them (in reality, take their expectation), and then update the parameters in our model assuming our guesses were correct. So the E step is to take the expectation of the c_i1s and c_i2s. The M step is to take maximum likelihood estimates of p_1, p_2 and \lambda given these cs.
Does that make a bit more sense? I can write out the updates for the E and M step if you prefer. EM then just guarantees that by following this procedure, likelihood will never decrease as iterations increase.

Optimal Bucket Size and No. of Buckets

Sorry this post is not related to coding but more to data structures and Algorithms.
I'm having large amount of data each having different frequencies. The approximate figure plot seems to be a Bell curve. I now want to display the data in ranges which most precisely describes the frequency of the ranges.
e.g. the entire range of data has total no. of frequencies but this range or bucket size is not precise and may be made more precise.(e.g if some data is more concentrated in a particular frequency zone, we may build up a bucket with less data size but having more closely related frequencies.)
Any help regarding some algorithm .
I thought of an algorithm related to binary search.
Any ideas folks.
Not sure I am following, but it seems you are looking for k beans, where for each two beans, the probability of the data falling in one bean is identical for it being in the other bean.
From your description, your data seems to be normally distributed, or T-distributed.
One can evaluate the mean and standard deviation of the data, let the extracted S.D. be s and the mean be u.
The standard formulas for evaluating the mean and S.D. from the sample are1:
u = (x1 + x2 + ... + xn) / n (simple average)
s^2 = Sigma((xi - u)^2)/(n-1)
Given this information, you can evaluate the distribution of your data, which is N(u,s^2). Given this information, you can create a random variabe: X~N(u,s^2)2
Now all is left is finding the a,b,... as follows (assuming 10 buckets, this can obviously be modified as you wish):
P(X<a) = 0.1
P(X<b) = 0.2
P(X<c) = 0.3
...
After finding a,b,c,... you have your beans: (-infinity,a], (a,b], (a,c], ...
(1) evaluating variance: http://en.wikipedia.org/wiki/Variance#Population_variance_and_sample_variance
(2)The real distribution for this variable is actually t-distribution, since the variance is unknown - and extracted from the data. However - for large enough n - t-distribution decays into normal distribution.
First count all the indexes then subtract the repeating values this will give you optimal number of buckets. but at small level

How to provide most relevant results with Multiple Factor Weighted Sorting

I need to provide a weighted sort on 2+ factors, ordered by "relevancy". However, the factors aren't completely isolated, in that I want one or more of the factors to affect the "urgency" (weight) of the others.
Example: contributed content (articles) can be up-/down-voted, and thus have a rating; they have a post date, and they're also tagged with categories. Users write the articles and can vote, and may or may not have some kind of ranking themselves (expert, etc). Probably similar to StackOverflow, right?
I want to provide each user with a list of articles grouped by tag but sorted by "relevancy", where relevancy is calculated based on the rating and age of the article, and possibly affected by the ranking of the author. I.E. a highly ranked article that was written several years ago may not necessarily be as relevant as a medium ranked article written yesterday. And maybe if an article was written by an expert it would be treated as more relevant than one written by "Joe Schmoe".
Another good example would be assigning hotels a "meta score" comprised of price, rating, and attractions.
My question is, what is the best algorithm for multiple factor sorting? This may be a duplicate of that question, but I'm interested in a generic algorithm for any number of factors (a more reasonable expectation is 2 - 4 factors), preferably a "fully-automatic" function that I don't have to tweak or require user input, and I can't parse linear algebra and eigenvector wackiness.
Possibilities I've found so far:
Note: S is the "sorting score"
"Linearly weighted" - use a function like: S = (w1 * F1) + (w2 * F2) + (w3 * F3), where wx are arbitrarily assigned weights, and Fx are the values of the factors. You'd also want to normalize F (i.e. Fx_n = Fx / Fmax). I think this is kinda how Lucene search works.
"Base-N weighted" - more like grouping than weighting, it's just a linear weighting where weights are increasing multiples of base-10 (a similar principle to CSS selector specificity), so that more important factors are significantly higher: S = 1000 * F1 + 100 * F2 + 10 * F3 ....
Estimated True Value (ETV) - this is apparently what Google Analytics introduced in their reporting, where the value of one factor influences (weights) another factor - the consequence being to sort on more "statistically significant" values. The link explains it pretty well, so here's just the equation: S = (F2 / F2_max * F1) + ((1 - (F2 / F2_max)) * F1_avg), where F1 is the "more important" factor ("bounce rate" in the article), and F2 is the "significance modifying" factor ("visits" in the article).
Bayesian Estimate - looks really similar to ETV, this is how IMDb calculates their rating. See this StackOverflow post for explanation; equation: S = (F2 / (F2+F2_lim)) * F1 + (F2_lim / (F2+F2_lim)) × F1_avg, where Fx are the same as #3, and F2_lim is the minimum threshold limit for the "significance" factor (i.e. any value less than X shouldn't be considered).
Options #3 or #4 look really promising, since you don't really have to choose an arbitrary weighting scheme like you do in #1 and #2, but the problem is how do you do this for more than two factors?
I also came across the SQL implementation for a two-factor weighting algorithm, which is basically what I'll need to write eventually.
As mentioned in the comments, I would suggest what's called the 'compromise solution' to anyone with a similar problem who is more concerned with not having to set weights than with making one criterion more heavily weighted than the others.
Basically, you consider each of your criterion as a coordinate (after normalization, of course). Based on your judgement, you choose the absolute optimal point, e.g. in this case, the highest rank author, the newest article, etc. Once you choose the optimal solution, each other 'solution' is rated based on its distance from that optimal. A sample formula would be the inverse of the Euclidean distance for each article's score: S = 1/(sqrt((rank - rank_ideal)^2 + (age - age_ideal)^2 + ... + (xn - xn_ideal)^2)).
This treats all criteria as equal, so keep that in mind.
Consider chaining of the weights. E.g. you have 3 factors: X, Y and Z.
You can calculate ETVyz as W = (Z/Zmax * Y) + (1 - Z/Zmax) * Yavg for each record and then calculate ETVxw as S = (W/Wmax * X) + (1 - W/Wmax) * Xavg.
You can chain more factors similary.
The solution, pointed shortly by #gankoji is a simplification of the TOPSIS method.
In TOPSIS the compromise solution can be regarded as choosing the solution with the shortest Euclidean distance from the ideal solution and the farthest Euclidean distance from the negative ideal solution.
This class of problems falls under the term MCDM - Multiple Criteria Decision Making.
Python packages scikit-criteria and mcdm provide implementations of most popular methods. The package docs link to the respective algorithm papers.

Most efficient algorithm to compute a common numerator of a sum of fractions

I'm pretty sure that this is the right site for this question, but feel free to move it to some other stackexchange site if it fits there better.
Suppose you have a sum of fractions a1/d1 + a2/d2 + … + an/dn. You want to compute a common numerator and denominator, i.e., rewrite it as p/q. We have the formula
p = a1*d2*…*dn + d1*a2*d3*…*dn + … + d1*d2*…d(n-1)*an
q = d1*d2*…*dn.
What is the most efficient way to compute these things, in particular, p? You can see that if you compute it naïvely, i.e., using the formula I gave above, you compute a lot of redundant things. For example, you will compute d1*d2 n-1 times.
My first thought was to iteratively compute d1*d2, d1*d2*d3, … and dn*d(n-1), dn*d(n-1)*d(n-2), … but even this is inefficient, because you will end up computing multiplications in the "middle" twice (e.g., if n is large enough, you will compute d3*d4 twice).
I'm sure this problem could be expressed somehow using maybe some graph theory or combinatorics, but I haven't studied enough of that stuff to have a good feel for it.
And one note: I don't care about cancelation, just the most efficient way to multiply things.
UPDATE:
I should have known that people on stackoverflow would be assuming that these were numbers, but I've been so used to my use case that I forgot to mention this.
We cannot just "divide" out an from each term. The use case here is a symbolic system. Actually, I am trying to fix a function called .as_numer_denom() in the SymPy computer algebra system which presently computes this the naïve way. See the corresponding SymPy issue.
Dividing out things has some problems, which I would like to avoid. First, there is no guarantee that things will cancel. This is because mathematically, (a*b)**n != a**n*b**n in general (if a and b are positive it holds, but e.g., if a == b ==-1 and n == 1/2, you get (a*b)**n == 1**(1/2) == 1 but (-1)**(1/2)*(-1)**(1/2) == I*I == -1). So I don't think it's a good idea to assume that dividing by an will cancel it in the expression (this may be actually be unfounded, I'd need to check what the code does).
Second, I'd like to also apply a this algorithm to computing the sum of rational functions. In this case, the terms would automatically be multiplied together into a single polynomial, and "dividing" out each an would involve applying the polynomial division algorithm. You can see in this case, you really do want to compute the most efficient multiplication in the first place.
UPDATE 2:
I think my fears for cancelation of symbolic terms may be unfounded. SymPy does not cancel things like x**n*x**(m - n) automatically, but I think that any exponents that would combine through multiplication would also combine through division, so powers should be canceling.
There is an issue with constants automatically distributing across additions, like:
In [13]: 2*(x + y)*z*(S(1)/2)
Out[13]:
z⋅(2⋅x + 2⋅y)
─────────────
2
But this is first a bug and second could never be a problem (I think) because 1/2 would be split into 1 and 2 by the algorithm that gets the numerator and denominator of each term.
Nonetheless, I still want to know how to do this without "dividing out" di from each term, so that I can have an efficient algorithm for summing rational functions.
Instead of adding up n quotients in one go I would use pairwise addition of quotients.
If things cancel out in partial sums then the numbers or polynomials stay smaller, which makes computation faster.
You avoid the problem of computing the same product multiple times.
You could try to order the additions in a certain way, to make canceling more likely (maybe add quotients with small denominators first?), but I don't know if this would be worthwhile.
If you start from scratch this is simpler to implement, though I'm not sure it fits as a replacement of the problematic routine in SymPy.
Edit: To make it more explicit, I propose to compute a1/d1 + a2/d2 + … + an/dn as (…(a1/d1 + a2/d2) + … ) + an/dn.
Compute two new arrays:
The first contains partial multiples to the left: l[0] = 1, l[i] = l[i-1] * d[i]
The second contains partial multiples to the right: r[n-1] = 1, r[i] = d[i] * r[i+1]
In both cases, 1 is the multiplicative identity of whatever ring you are working in.
Then each of your terms on the top, t[i] = l[i-1] * a[i] * r[i+1]
This assumes multiplication is associative, but it need not be commutative.
As a first optimization, you don't actually have to create r as an array: you can do a first pass to calculate all the l values, and accumulate the r values during a second (backward) pass to calculate the summands. No need to actually store the r values since you use each one once, in order.
In your question you say that this computes d3*d4 twice, but it doesn't. It does multiply two different values by d4 (one a right-multiplication and the other a left-multiplication), but that's not exactly a repeated operation. Anyway, the total number of multiplications is about 4*n, vs. 2*n multiplications and n divisions for the other approach that doesn't work in non-commutative multiplication or non-field rings.
If you want to compute p in the above expression, one way to do this would be to multiply together all of the denominators (in O(n), where n is the number of fractions), letting this value be D. Then, iterate across all of the fractions and for each fraction with numerator ai and denominator di, compute ai * D / di. This last term is equal to the product of the numerator of the fraction and all of the denominators other than its own. Each of these terms can be computed in O(1) time (assuming you're using hardware multiplication, otherwise it might take longer), and you can sum them all up in O(n) time.
This gives an O(n)-time algorithm for computing the numerator and denominator of the new fraction.
It was also pointed out to me that you could manually sift out common denominators and combine those trivially without multiplication.

Best algorithm for avoiding loss of precision?

A recent homework assignment I have received asks us to take expressions which could create a loss of precision when performed in the computer, and alter them so that this loss is avoided.
Unfortunately, the directions for doing this haven't been made very clear. From watching various examples being performed, I know that there are certain methods of doing this: using Taylor series, using conjugates if square roots are involved, or finding a common denominator when two fractions are being subtracted.
However, I'm having some trouble noticing exactly when loss of precision is going to occur. So far the only thing I know for certain is that when you subtract two numbers that are close to being the same, loss of precision occurs since high order digits are significant, and you lose those from round off.
My question is what are some other common situations I should be looking for, and what are considered 'good' methods of approaching them?
For example, here is one problem:
f(x) = tan(x) − sin(x) when x ~ 0
What is the best and worst algorithm for evaluating this out of these three choices:
(a) (1/ cos(x) − 1) sin(x),
(b) (x^3)/2
(c) tan(x)*(sin(x)^2)/(cos(x) + 1).
I understand that when x is close to zero, tan(x) and sin(x) are nearly the same. I don't understand how or why any of these algorithms are better or worse for solving the problem.
Another rule of thumb usually used is this: When adding a long series of numbers, start adding from numbers closest to zero and end with the biggest numbers.
Explaining why this is good is abit tricky. when you're adding small numbers to a large numbers, there is a chance they will be completely discarded because they are smaller than then lowest digit in the current mantissa of a large number. take for instance this situation:
a = 1,000,000;
do 100,000,000 time:
a += 0.01;
if 0.01 is smaller than the lowest mantissa digit, then the loop does nothing and the end result is a == 1,000,000
but if you do this like this:
a = 0;
do 100,000,000 time:
a += 0.01;
a += 1,000,000;
Than the low number slowly grow and you're more likely to end up with something close to a == 2,000,000 which is the right answer.
This is ofcourse an extreme example but I hope you get the idea.
I had to take a numerics class back when I was an undergrad, and it was thoroughly painful. Anyhow, IEEE 754 is the floating point standard typically implemented by modern CPUs. It's useful to understand the basics of it, as this gives you a lot of intuition about what not to do. The simplified explanation of it is that computers store floating point numbers in something like base-2 scientific notation with a fixed number of digits (bits) for the exponent and for the mantissa. This means that the larger the absolute value of a number, the less precisely it can be represented. For 32-bit floats in IEEE 754, half of the possible bit patterns represent between -1 and 1, even though numbers up to about 10^38 are representable with a 32-bit float. For values larger than 2^24 (approximately 16.7 million) a 32-bit float cannot represent all integers exactly.
What this means for you is that you generally want to avoid the following:
Having intermediate values be large when the final answer is expected to be small.
Adding/subtracting small numbers to/from large numbers. For example, if you wrote something like:
for(float index = 17000000; index < 17000001; index++) {}
This loop would never terminate becuase 17,000,000 + 1 is rounded down to 17,000,000.
If you had something like:
float foo = 10000000 - 10000000.0001
The value for foo would be 0, not -0.0001, due to rounding error.
My question is what are some other
common situations I should be looking
for, and what are considered 'good'
methods of approaching them?
There are several ways you can have severe or even catastrophic loss of precision.
The most important reason is that floating-point numbers have a limited number of digits, e.g..doubles have 53 bits. That means if you have "useless" digits which are not part of the solution but must be stored, you lose precision.
For example (We are using decimal types for demonstration):
2.598765000000000000000000000100 -
2.598765000000000000000000000099
The interesting part is the 100-99 = 1 answer. As 2.598765 is equal in both cases, it
does not change the result, but waste 8 digits. Much worse, because the computer doesn't
know that the digits is useless, it is forced to store it and crams 21 zeroes after it,
wasting at all 29 digits. Unfortunately there is no way to circumvent it for differences,
but there are other cases, e.g. exp(x)-1 which is a function occuring very often in physics.
The exp function near 0 is almost linear, but it enforces a 1 as leading digit. So with 12
significant digits
exp(0.001)-1 = 1.00100050017 - 1 = 1.00050017e-3
If we use instead a function expm1(), use the taylor series:
1 + x +x^2/2 +x^3/6 ... -1 =
x +x^2/2 +x^3/6 =: expm1(x)
expm1(0.001) = 1.00500166667e-3
Much better.
The second problem are functions with a very steep slope like tangent of x near pi/2.
tan(11) has a slope of 50000 which means that any small deviation caused by rounding errors
before will be amplified by the factor 50000 ! Or you have singularities if e.g. the result approaches 0/0, that means it can have any value.
In both cases you create a substitute function, simplying the original function. It is of no use to highlight the different solution approaches because without training you will simply not "see" the problem in the first place.
A very good book to learn and train: Forman S. Acton: Real Computing made real
Another thing to avoid is subtracting numbers that are nearly equal, as this can also lead to increased sensitivity to roundoff error. For values near 0, cos(x) will be close to 1, so 1/cos(x) - 1 is one of those subtractions that you'd like to avoid if possible, so I would say that (a) should be avoided.

Resources