Bayesian inference - probability

I have an instrument that will either pass or fail a series of three tests. The instrument must pass all three tests to be considered successful. How may I use Bayesian inference to look at the probability of passing each case based on evidence? (based on an instrument passing each past-test in turn).
Looking at just the first test - I know this from historical records of instrument tests. You can also see that each test has an acceptance boundary of -3% to +3%:
My Assumptions:
Probabilities are dependent on each other- we are looking at the same instrument over all three tests
From this historical data I see that the probability of passing test A is P(A)=0.84, so failing is P(‘A)=0.16
Without knowing anything about an instrument a good assumption would be equie-probabilities of passing & failing the first test - The Hypotheses (H) is that the instrument passed P(H) = 0.5; this also gives us the failed probability P(‘H) = 0.5.
From my understanding I need to find P(H) given the Data (D), in Bayesian terms - I would then update P(H) given the results of test A -
**P(H|D) = P(H) P(D|H) / P(D)** Where:
**P(D) = P(D|H)*P(H) + P(D|’H) P(‘H)**
This is where I get lost, I think this is correct:
P(H) = P('H) = 0.5 // prob of passing/failing test-A without any information
P(D|H) = 0.84 // prob of passing test-A from historical records
P('D|H) = 0.16 // prob of failing test-A from historical records
P(D) = P(D|H)*P(H) + P(D|’H) P(‘H) = 0.84*0.5 + 0.16*0.5
P(D) = 0.5
Giving a Bayesian value of:
P(H|D) = P(H) P(D|H) / P(D) = 0.5*0.84 / 0.5,
P(H|D) = 0.84 which is my new updated value for P(H) in test-B?
Out of interest all three tests look similar:

So there are a couple of things to take into account here. first You are right that the a priori probabilities to use are .5 and .5 respectively because it is how we mathematically encode not knowing what is going on, but you are showing the three graphs independently of each other and writing Bayes equations with only 1 dimension and that violates your dependence assumption. Also there is no need to use your marginalized P(D) in this setup to get to the conditional probabilities you are asking about.
What you are are really after is the conditional probability that the instrument will pass test C given how it did on test A and or test B
if you have only done test A then Bayes says:
P(C|A) = P(A|C)P(C)/P(A) or P(B|A) = P(A|B)P(B)/P(A)
Where A,B,and C can have values of pass or fail.
If you have done tests A and B then you want to know the probability of passing test C which Bayes says is:
P(C|A,B) = P(A,B|C)P(C)/P(A,B)
Which looks much more complicated, but the thing is you don’t really need to do Bayesian Inference to get the conditional probabilities you are asking for:
What is my probability of passing the next test given that I have already passed or failed this test?
You have all the information you need to compute that directly. One typically uses Bayesian inference when they don’t have that luxury.
To answer your question about how to calculate the probabilities that a future test will pass based upon whether or not it has already passed one or more tests think about what the values you want mean.
“Given that the instrument passed (or failed) test 1, what is the chance it will pass test 2 and test 3”
With your historical data you can answer this question directly.
Your question states that you care about probability of pass/fail so there are 2 possible outcomes for each test meaning that you really only have 8 states to consider for each instrument test set
(Number of TestA Outcomes)* (Number of TestB Outcomes)* (Number of TestC Outcomes) = 2*2*2 = 8
To calculate the probabilities you want, consider a 3D matrix which we will call ProbabilityHistogram with a cell for each outcome. Thus the matrix is 2*2*2. Where the matrix is indexed by whether or not a test has been passed historically. We are going to use this matrix to build a histogram of historical pass / fail data and then reference that histogram to build your probabilities of interest in the code below.
In our approach, the number of times that any instrument previously tested passed test A, failed test B, and Passed Test C would be found in ProbabilityHistogram [1,0,1], passing all three would be found in ProbabilityHistogram [1,1,1], failing all three ProbabilityHistogram [0,0,0], etc.
Here is how to calculate the values you want
Setup of Required Histogram
Start by defining a 2*2*2 matrix to hold histogram data
reading in your historical data
For every historical test you have in your data set, update the ProbabilityHistogram by using the UpdateProbHisto code below
Calculate the Probabilities of interest:
Calculate Conditional probabilities after one test using CProb_BCgA below
Calculate Conditional Probabilities after two tests using CProb_CgAB below
Code: (Sorry it is in C# because I have limited experience in Python, if you have questions just leave a comment and I'll explain further)
Set up the 3D matrix
//Define Probability Histogram
double[, ,] ProbHisto = new double[2, 2, 2];// [A Test Outcome, B Test Outcome, C Test Outcome]
Update the Histogram
//Update Histogram based on historical data.
//pass in how the instrument did on each test as one dataset
void updateProbHisto(bool APassed, bool BPassed, bool CPassed) {
ProbHisto[Convert.ToInt16(APassed), Convert.ToInt16(BPassed), Convert.ToInt16(CPassed)]++;
Calculate Probabilities after one test
//calculate the conditional probability that test B and test C will Pass given A's test reult
double[] CProb_BCgA(bool ATestResult) {
//Calculate probability of test B and test C success looking only at tests that passed or failed the same way this instrument did given the A test result
double[] rvalue = {0.0,0.0};//P(B|A), P(C|A)
double BPassesGivenA = ProbHisto[Convert.ToInt16(ATestResult),1,0] + ProbHisto[Convert.ToInt16(ATestResult),1,1];
double CPassesGivenA = ProbHisto[Convert.ToInt16(ATestResult),1,1] + ProbHisto[Convert.ToInt16(ATestResult),0,1];
rvalue[0] = BPassesGivenA /(BPassesGivenA+ProbHisto[Convert.ToInt16(ATestResult),0,0] + ProbHisto[Convert.ToInt16(ATestResult),0,1]); // BPasses over BPasses + BFailures
rvalue[1] = CPassesGivenA /(CPassesGivenA+ProbHisto[Convert.ToInt16(ATestResult),0,0] + ProbHisto[Convert.ToInt16(ATestResult),1,0]);// CPasses over CPasses + CFailures
return rvalue;
Calculate probabilities after two tests
//Calculate the conditional probability that test C will pass looking only at tests that passed or failed the same way this instrument did given the A and B test results
double CProb_CgAB(bool ATestResult, bool BTestResult)
//Calculate probability of test C success given A and B test results
double rvalue = 0.0;// P(C|A,B)
double CPassesGivenAB = ProbHisto[Convert.ToInt16(ATestResult),Convert.ToInt16(BTestResult),1];
rvalue= CPassesGivenAB /(CPassesGivenAB + ProbHisto[Convert.ToInt16(ATestResult),Convert.ToInt16(BTestResult),0]);// CPasses over CPasses + CFailures
return rvalue;
The conditional probability codes are set assuming that you do test A and then test B and then test C (BCgA = Probability of B Passing and C passing given result of test A), but it is straightforward to sub in the test result for B or C ins tread of the result for A just bare in mind which index you are putting the test pass/fail data in.

As Semicolons and Duct Tape said, I too don't think that you need P(H) at all to answer the question. To answer what P(C|A) i.e. the probability of passing the test C is given that you pass the test, all you need is P(A & C) and P(A), which seems to be already available to you. Same is the case with P(B|A).
Here's a python snippet that shows this in action. Assume that the structure experiment is a list of tests where each test is a list of three numbers corresponding to the result (1 for pass, 0 for fail) of test A, test B and test C respectively.
def prob_yx(y, x, exp):
"P(y|x). Data is the past experimental runs"
# P (X & Y)
c_xy = filter(lambda _: _[x] & _[y], exp)
# P (Y)
c_x = filter(lambda _: _[x], exp)
return len(c_xy) / float(len(c_x))
experiment = [
[0, 0, 1],
[1, 1, 1],
[1, 0, 0],
[1, 1, 1],
[1, 1, 0]
A = 0
B = 1
C = 2
# B given A
print prob_yx(B, A, experiment)
# C given A
print prob_yx(C, A, experiment)
# C given B
print prob_yx(C, B, experiment)
This gives
Hope this is helpful..


How to generate correlated Uniform[0,1] variables

(This question is related to how to generate a dataset of correlated variables with different distributions?)
In Stata, say that I create a random variable following a Uniform[0,1] distribution:
set seed 100
gen random1 = runiform()
I now want to create a second random variable that is correlated with the first (the correlation should be .75, say), but is bounded by 0 and 1. I would like this second variable to also be more-or-less Uniform[0,1]. How can I do this?
This won't be exact, but the NORTA/copula method should be pretty close and easy to implement.
The relevant citation is:
Cario, Marne C., and Barry L. Nelson. Modeling and generating random vectors with arbitrary marginal distributions and correlation matrix. Technical Report, Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, Illinois, 1997.
The paper can be found here.
The general recipe to generate correlated random variables from any distribution is:
Draw two (or more) correlated variables from a joint standard normal distribution using corr2data
Calculate the univariate normal CDF of each of these variables using normal()
Apply the inverse CDF of any distribution to simulate draws from that distribution.
The third step is pretty easy with the [0,1] uniform: you don't even need it. Typically, the magnitude of the correlations you get will be less than the magnitudes of the original (normal) correlations, so it might be useful to bump those up a bit.
Stata Code for 2 uniformish variables that have a correlation of 0.75:
// Step 1
matrix C = (1, .75 \ .75, 1)
corr2data x y, n(10000) corr(C) double
corr x y, means
// Steps 2-3
replace x = normal(x)
replace y = normal(y)
// Make sure things worked
corr x y, means
stack x y, into(z) clear
lab define vars 1 "x" 2 "y"
lab val _stack vars
capture ssc install bihist
bihist z, by(_stack) density tw1(yline(-1 0 1))
If you want to improve the approximation for the uniform case, you can transform the correlations like this (see section 5 of the linked paper):
matrix C = (1,2*sin(.75*_pi/6)\2*sin(.75*_pi/6),1)
This is 0.76536686 instead of the 0.75.
Code for the question in the comments
The correlation matrix C written more compactly, and I am applying the transformation:
matrix C = ( 1, ///
2*sin(-.46*_pi/6), 1, ///
2*sin(.53*_pi/6), 2*sin(-.80*_pi/6), 1, ///
2*sin(0*_pi/6), 2*sin(-.41*_pi/6), 2*sin(.48*_pi/6), 1 )
corr2data v1 v2 v3 v4, n(10000) corr(C) cstorage(lower)
forvalues i=1/4 {
replace v`i' = normal(v`i')

increase the performance to generate random numbers in a range with step-size

To make sure that this is not a duplicate, I have already checked this and this out.
I want to generate random numbers in a specific range including step size (not continuous distribution).
For example, I want to generate random numbers between -2 and 3 in which the step between two consecutive numbers is 0.02. (e.g. [-2 -1.98 -1.96 ... 2.69 2.98 3] so a generated number should be 2.96 not 2.95).
I have tried this:
r = (b-a).*rand(5,1) + a;
for i=1:length(r)
if r(i) >= 0
if mod(fix(r(i)),2)
if mod(fix(r(i)),2)
and it works.
there is an alternative way to do this in MATLAB which is :
y = datasample(-2:0.02:3,5,'Replace',false)
I want to know:
How can I make my own implementation faster (improve the
If the second method is faster (it looks faster to me), how can I
use similar implementation in C++?
Those previous answers do cover your case if you read carefully. For example, this one produces random numbers between limits with a step size of one. But let's generalize this to an arbitrary step size in case you can't figure out how to get there. There are several different ways. Here's one using randi where we use the default step size of one and the range from one to the number possible values as indices:
lo = 2;
hi = 3;
step = 0.02;
v = lo:step:hi;
r = v(randi(length(v),[5 1]))
If you look inside datasample (type edit datasample in your command window to view the code) you'll see that it's doing something very similar to this answer. In the case of the 'Replace' option being true see around line 135 (in R2013a at least).
If the 'Replace' option is false, as in your use of datasample above, then randperm actually needs to be used instead (see around line 159):
lo = 2;
hi = 3;
step = 0.02;
v = lo:step:hi;
r = v(randperm(length(v),51))
Because there is no replacement in this case, 51 is the maximum number of values that can be requested in a call and all values of r will be unique.
In C++ you should not use rand() if you're doing scientific computing and generating large numbers of random variates. Instead you should use a large period random number generator such as Mersenne Twister (the default in Matlab). C++11 includes a version of this generator as part of . More here in rand(). If you want something fast, you should try the Double precision SIMD-oriented Fast Mersenne Twister. You'll have to ask another question if you want to implement your code in C++.
The distribution you want is a simple transform of integers, so how about:
step = 0.02
r = randi([-2 3] / step, [5, 1]) * step;
In C++, rand() generates integers too, so it should be pretty obvious how to take a similar approach there.

a faster way of implementing the nested loop with gamma function

I am trying to evaluate the following integral:
I can find the area for the following polynomial as follows:
pn =
-0.0250 0.0667 0.2500 -0.6000 0
First using the integration by Simpson's rule
fn=#(x) exp(polyval(pn,x));
fprintf('area evaluated by Simpsons rule : %f \n',area)
and the result is area evaluated by Simpsons rule : 11.483072
Then with the following code that evaluates the summation in the above formula with gamma function
for n=0:40;
for m=0:40;
for p=0:40;
result=result+ (b^n * c^m * d^p) / ( factorial(n)*factorial(m)*factorial(p) ) *...
gamma( (3*n+2*m+p+1)/4 ) / (-a)^( (3*n+2*m+p+1)/4 );
and this returns 11.4831. More or less the same result with the quad function. Now my question is whether or not it is possible for me to get rid of this nested loop as I will construct the cumulative distribution function so that I can get samples from this distribution using the inverse CDF transform. (for constructing the cdf I will use gammainc i.e. the incomplete gamma function instead of gamma)
I will need to sample from such densities that may have different polynomial coefficients and speed is of concern to me. I can already sample from such densities using Monte Carlo methods but I would like to see whether or not it is possible for me to use exact sampling from the density in order to speed up.
Thank you very much in advance.
There are several things one might do. The simplest is to avoid calling factorial. Instead one can use the relation that
factorial(n) = gamma(n+1)
Since gamma seems to be actually faster than a call to factorial, you can save a bit there. Even better, you can
>> timeit(#() factorial(40))
ans =
>> timeit(#() gamma(41))
ans =
>> timeit(#() gammaln(41))
ans =
Even better, one can do all 4 calls in a single call to gammaln. For example, think about what this does:
gammaln([(3*n+2*m+p+1)/4,n+1,m+1,p+1])*[1 -1 -1 -1]'
Note that this call has no problem with overflows either in case your numbers get large enough. And since gammln is vectorized, that one call is fast. It costs little more time to compute 4 values than it does to compute one.
>> timeit(#() gammaln([15 20 40 30]))
ans =
>> timeit(#() gammaln(40))
ans =
Admittedly, if you use gammaln, you will need a call to exp at the end to recover the final result. You could do it with a single call to gamma however too. Perhaps like this:
g = gamma([(3*n+2*m+p+1)/4,n+1,m+1,p+1]);
g = g(1)/(g(2)*g(3)*g(4));
Next, you can be more creative in the inner loop on p. Rather than a full loop, coupled with a test to ignore the combinations you don't need, why not just do this?
for p=mod(n,2):2:40
That statement will select only those values of p that would have been used anyway, so now you can drop the if statement completely.
All of the above will give you what I'll guess is about a 5x speed increase in your loops. But it still has a set of nested loops. With some effort, you might be able to improve that too.
For example, rather than computing all of those factorials (or gamma functions) many times, do it ONCE. This should work:
nlim = 40;
facts = factorial(0:nlim);
gammas = gamma((0:(6*nlim+1))/4);
for n=0:nlim
for m=0:nlim
for p=mod(n,2):2:nlim
result = result + (b.^n * c.^m * d.^p) ...
.*gammas(3*n+2*m+p+1 + 1) ...
./ (facts(n+1).*facts(m+1).*facts(p+1)) ...
./ (-a)^( (3*n+2*m+p+1)/4 );
In my test on my machine, I find that your triply nested loops required 4.3 seconds to run. My version above produces the same result, yet required only 0.028418 seconds, a speedup of roughly 150 to 1, despite the triply nested loops.
Well, without even making changes to your code you could install an excellent package from Tom Minka at Microsoft called lightspeed which replaces some built-in matlab functions with much faster versions. I know there's a replacement for gammaln().
You'll get nontrivial speed improvements, though I'm not sure how much, and it's straight-forward to install.

finding the best/ scale/shift between two vectors

I have two vectors that represents a function f(x), and another vector f(ax+b) i.e. a scaled and shifted version of f(x). I would like to find the best scale and shift factors.
*best - by means of least squares error , maximum likelihood, etc.
any ideas?
for example:
f1 = [0;0.450541598502498;0.0838213779969326;0.228976968716819;0.91333736150167;0.152378018969223;0.825816977489547;0.538342435260057;0.996134716626885;0.0781755287531837;0.442678269775446;0];
f2 = [-0.029171964726699;-0.0278570165494982;0.0331454732535324;0.187656956432487;0.358856370923984;0.449974662483267;0.391341738643094;0.244800719791534;0.111797007617227;0.0721767235173722;0.0854437239807415;0.143888234591602;0.251750993723227;0.478953530572365;0.748209818420035;0.908044924557262;0.811960826711455;0.512568916956487;0.22669198638799;0.168136111568694;0.365578085161896;0.644996661336714;0.823562159983554;0.792812945867018;0.656803251999341;0.545799498053254;0.587013303815021;0.777464637372241;0.962722388208354;0.980537136457874;0.734416947254272;0.375435649393553;0.106489547770962;0.0892376361668696;0.242467741982851;0.40610516900965;0.427497319032133;0.301874099075184;0.128396341665384;0.00246347624097456;-0.0322120242872125]
*note that f(x) may be irreversible...
For each f(x), take the absolute value of f(x) and normalize it such that it can be considered a probability mass function over its support. Calculate the expected value E[x] and variance of Var[x]. Then, we have that
E[a x + b] = a E[x] + b
Var[a x + b] = a^2 Var[x]
Use the above equations and the known values of E[x] and Var[x] to calculate a and b. Taking your values of f1 and f2 from your example, the following Octave script performs this procedure:
% Octave script
% f1, f2 are defined as given in your example
f1 = [zeros(length(f2) - length(f1), 1); f1];
save_f1 = f1; save_f2 = f2;
f1 = abs( f1 ); f2 = abs( f2 );
f1 = f1 ./ sum( f1 ); f2 = f2 ./ sum( f2 );
mean = #(x)sum(((1:length(x))' .* x));
var = #(x)sum((((1:length(x))'-mean(x)).^2) .* x);
m1 = mean(f1); m2 = mean(f2);
v1 = var(f1); v2 = var(f2)
a = sqrt( v2 / v1 ); b = m2 - a * m1;
plot( a .* (1:length( save_f1 )) + b, save_f1, ...
1:length( save_f2 ), save_f2 );
axis([0 length( save_f1 )];
And the output is
Here's a simple, effective, but perhaps somewhat naive approach.
First make sure you make a generic interpolator through both functions. That way you can evaluate both functions in between the given data points. I used a cubic-splines interpolator, since that seems general enough for the type of smooth functions you provided (and does not require additional toolboxes).
Then you evaluate the source function ("original") at a large number of points. Use this number also as a parameter in an inline function, that takes as input X, where
X = [a b]
(as in ax+b). For any input X, this inline function will compute
the function values of the target function at the same x-locations, but then scaled and offset by a and b, respectively.
The sum of the squared-differences between the resulting function values, and the ones of the source function you computed earlier.
Use this inline function in fminsearch with some initial estimate (one that you have obtained visually or by via automatic means). For the example you provided, I used a few random ones, which all converged to near-optimal fits.
All of the above in code:
function s = findScaleOffset
%% initialize
f2 = [0;0.450541598502498;0.0838213779969326;0.228976968716819;0.91333736150167;0.152378018969223;0.825816977489547;0.538342435260057;0.996134716626885;0.0781755287531837;0.442678269775446;0];
f1 = [-0.029171964726699;-0.0278570165494982;0.0331454732535324;0.187656956432487;0.358856370923984;0.449974662483267;0.391341738643094;0.244800719791534;0.111797007617227;0.0721767235173722;0.0854437239807415;0.143888234591602;0.251750993723227;0.478953530572365;0.748209818420035;0.908044924557262;0.811960826711455;0.512568916956487;0.22669198638799;0.168136111568694;0.365578085161896;0.644996661336714;0.823562159983554;0.792812945867018;0.656803251999341;0.545799498053254;0.587013303815021;0.777464637372241;0.962722388208354;0.980537136457874;0.734416947254272;0.375435649393553;0.106489547770962;0.0892376361668696;0.242467741982851;0.40610516900965;0.427497319032133;0.301874099075184;0.128396341665384;0.00246347624097456;-0.0322120242872125];
figure(1), clf, hold on
h(1) = subplot(2,1,1); hold on
h(2) = subplot(2,1,2); hold on
axis([0 max(length(f1),length(f2)), min(min(f1),min(f2)),max(max(f1),max(f2))])
%% make cubic interpolators and test points
pp1 = spline(1:numel(f1), f1);
pp2 = spline(1:numel(f2), f2);
maxX = max(numel(f1), numel(f2));
N = 100 * maxX;
x2 = linspace(1, maxX, N);
y1 = ppval(pp1, x2);
%% search for parameters
s = fminsearch(#(X) sum( (y1 - ppval(pp2,X(1)*x2+X(2))).^2 ), [0 0])
%% plot results
y2 = ppval( pp2, s(1)*x2+s(2));
figure(1), hold on
subplot(2,1,2), hold on
plot(x2,y2, 'r')
legend('before', 'after')
s =
2.886234493867320e-001 3.734482822175923e-001
Note that this computes the opposite transformation from the one you generated the data with. Reversing the numbers:
>> 1/s(1)
ans =
3.464721948700991e+000 % seems pretty decent
>> -s(2)
ans =
-3.734482822175923e-001 % hmmm...rather different from 7/11!
(I'm not sure about the 7/11 value you provided; using the exact values you gave to make a plot results in a less accurate approximation to the source function...Are you sure about the 7/11?)
Accuracy can be improved by either
using a different optimizer (fmincon, fminunc, etc.)
demanding a higher accuracy from fminsearch through optimset
having more sample points in both f1 and f2 to improve the quality of the interpolations
Using a better initial estimate
Anyway, this approach is pretty general and gives nice results. It also requires no toolboxes.
It has one major drawback though -- the solution found may not be the global optimizer, e.g., the quality of the outcomes of this method could be quite sensitive to the initial estimate you provide. So, always make a (difference) plot to make sure the final solution is accurate, or if you have a large number of such things to do, compute some sort of quality factor upon which you decide to re-start the optimization with a different initial estimate.
It is of course very possible to use the results of the Fourier+Mellin transforms (as suggested by chaohuang below) as an initial estimate to this method. That might be overkill for the simple example you provide, but I can easily imagine situations where this could indeed be very useful.
For the scale factor a, you can estimate it by computing the ratio of the amplitude spectra of the two signals since the Fourier transform is invariant to shift.
Similarly, you can estimate the shift factor b by using the Mellin transform, which is scale invariant.
Here's a super simple approach to estimate the scale a that works on your example data:
a = length(f2) / length(f1)
This gives 3.4167 which is close to your stated value of 3.4. If that estimate is good enough, you can use correlation to estimate the shift.
I realize that this is not exactly what you asked, but it may be an acceptable alternative depending on the data.
Both Rody Oldenhuis and jstarr's answers are correct. I'm adding my own answer just to sum things up, and connect between them.
I've messed up Rody's code a little bit and ended up with the following:
function findScaleShift
load f1f2
x0 = [length(f1)/length(f2) 0]; %initial guess, can do better
costFunc = #(z) sum((eval_f1(z,f2,n)-f1).^2);
opt.TolFun = eps;
title('squared error')
function y = eval_f1(x,f2,n)
t = maketform('affine',[x(1) 0 x(2); 0 1 0 ; 0 0 1]');
y=imtransform(f2',t,'cubic','xdata',[1 n ],'ydata',[1 1])';
This gives zero results:
This method is accurate but exhaustive and may take some time. Another disadvantage is that it finds only a local minima, and may give false results if initial guess (x0) is far.
On the other hand, jstarr method gave the following results:
xopt = [ 3.49655562549115 -0.676062367063033]
which is 10% deviation from the correct answer. Pretty fast solution, but not as accurate as I requested, but still should be noted.
I think in order to get the best results jstarr method should be used as an initial guess for the method purposed by Rody, giving an accurate solution.

Continuous Fourier transform on discrete data using Mathematica?

I have some periodic data, but the amount of data is not a multiple of
the period. How can I Fourier analyze this data? Example:
% Let's create some data for testing:
data = Table[N[753+919*Sin[x/623-125]], {x,1,25000}]
% I now receive this data, but have no idea that it came from the
formula above. I'm trying to reconstruct the formula just from 'data'.
% Looking at the first few non-constant terms of the Fourier series:
ListPlot[Table[Abs[Fourier[data]][[x]], {x,2,20}], PlotJoined->True,
shows an expected spike at 6 (since the number of periods is really
25000/(623*2*Pi) or about 6.38663, though we don't know this).
% Now, how do I get back 6.38663? One way is to "convolve" the data with
arbitrary multiples of Cos[x].
convolve[n_] := Sum[data[[x]]*Cos[n*x], {x,1,25000}]
% And graph the "convolution" near n=6:
Plot[convolve[n],{n,5,7}, PlotRange->All]
we see a spike roughly where expected.
% We try FindMaximum:
but the result is useless and inaccurate:
Machine precision is insufficient to achieve the requested accuracy or
Out[119]= {98.9285, {n -> 5.17881}}
because the function is very wiggly.
% By refining our interval (using visual analysis on the plots), we
finally find an interval where convolve[] doesn't wiggle too much:
Plot[convolve[n],{n,6.2831,6.2833}, PlotRange->All]
and FindMaximum works:
FindMaximum[convolve[n],{n,6.2831,6.2833}] // FortranForm
% However, this process is ugly, requires human intervention, and
computing convolve[] is REALLY slow. Is there a better way to do this?
% Looking at the Fourier series of the data, can I somehow divine the
"true" number of periods is 6.38663? Of course, the actual result
would be 6.283185, since my data fits that better (because I'm only
sampling at a finite number of points).
Based on Mathematica help for the Fourier function / Applications / Frequency Identification:
Checked on version 7
n = 25000;
data = Table[N[753 + 919*Sin[x/623 - 125]], {x, 1, n}];
pdata = data - Total[data]/Length[data];
f = Abs[Fourier[pdata]];
pos = Ordering[-f, 1][[1]]; (*the position of the first Maximal value*)
fr = Abs[Fourier[pdata Exp[2 Pi I (pos - 2) N[Range[0, n - 1]]/n],
FourierParameters -> {0, 2/n}]];
frpos = Ordering[-fr, 1][[1]];
N[(pos - 2 + 2 (frpos - 1)/n)]
returns 6.37072
Look for the period length using autocorrelation to get an estimate:
autocorrelate[data_, d_] :=
Plus ## (Drop[data, d]*Drop[data, -d])/(Length[data] - d)
ListPlot[Table[{d, autocorrelate[data, d]}, {d, 0, 5000, 100}]]
A smart search for the first maximum away from d=0 may be the best estimate you can get form the available data?
(* the data *)
data = Table[N[753+919*Sin[x/623-125]], {x,1,25000}];
(* Find the position of the largest Fourier coefficient, after
removing the last half of the list (which is redundant) and the
constant term; the [[1]] is necessary because Ordering returns a list *)
f2 = Ordering[Abs[Take[Fourier[data], {2,Round[Length[data]/2+1]}]],-1][[1]]
(* Result: 6 *)
(* Directly find the least squares difference between all functions of
the form a+b*Sin[c*n-d], with intelligent starting values *)
sol = FindMinimum[Sum[((a+b*Sin[c*n-d]) - data[[n]])^2, {n,1,Length[data]}],
(* Result (using //InputForm):
The step size in the search has become less than the tolerance prescribed by
the PrecisionGoal option, but the gradient is larger than the tolerance
specified by the AccuracyGoal option. There is a possibility that the method
has stalled at a point that is not a local minimum.
{2.1375902350021628*^-19, {a -> 753., b -> -919., c -> 0.0016051364365971107,
d -> 2.477886509998064}}
(* Create a table of values for the resulting function to compare to 'data' *)
tab = Table[a+b*Sin[c*x-d], {x,1,Length[data]}] /. sol[[2]];
(* The maximal difference is effectively 0 *)
Max[Abs[data-tab]] // InputForm
(* Result: 7.73070496506989*^-12 *)
Although the above doesn't necessarily fully answer my question, I found it
somewhat remarkable.
Earlier, I'd tried using FindFit[] with Method -> NMinimize (which is
supposed to give a better global fit), but that didn't work well,
possibly because you can't give FindFit[] intelligent starting values.
The error I get bugs me but appears to be irrelevant.
