How many FLOPS for FFT? - algorithm

I would like to know how many FLOPS a Fast Fourier Transform (FFT) performs.
So, if I have a 1 dimensional array of N float numbers and I would like to calculate the FFT of this set of numbers, how many FLOPS need to be performed?
I know that this depends on the used algorithm, but what about the fastest available?
I also know that the scaling of a FFT is of the order of N*log(N) but this would not answer my question.

That depends on implementation. Fastest does not necessary mean lowest FLOP nor highest FLOPS. The speed is often achieved by exploiting HW architecture rather than lowering FLOP. There are too many implementations out there so your question without actual code and architecture is unanswerable.
I like precomputed W matrix implementations as I usually use FFT for single resolution matrices many times so no need to compute W more then once per resolution. That can cut down FLOP per recursion layer significantly.
For example this DFFTcc has 14 FLOP per iteration using only +,-,* operations. Assuming 1D FFT case N=8 and using basic data-type if I did not make any silly mistake:
FLOP = 8*14 + (4+4)*14 +(2+2+2+2+2)*14 +(1+1+1+1+1+1+1+1)*2 = 14*N*log2(N) + 2*N = 352
If you use Real input/output you can even lower that for first/last recursion layer. But simple FLOP count is not enough as some operations are more complicated then others. And also FLOP are not the only thing that affect speed.
Now to get the FLOPS just measure time [s] the FFT takes:
FLOPS = FLOP/time

As underlined by Spektre, the actual FLOPS (Floating Point OPerations per Second) depend on the particular hardware and implementation and higher FLOP (Floating Point OPeration) algorithms may correspond to lower FLOPS implementations, just because with such implementations you can more effectively exploit the hardware.
If you want to compute the number of floating point operations for a Decimation In Time radix-2 approach, the you can refer to the following figure:
Let N the length of the sequence to be transformed. There is an overall number of log2N stages and each stage contains N/2 butterflies. Let us then consider the generic butterfly:
Let us rewrite the output of the generic butterfly as
E(i + 1) = E(i) + W * O(i)
O(i + 1) = E(i) - W * O(i)
A butterfly thus involves one complex multiplication and two complex additions. On rewriting the above equations in terms of real and imaginary parts, we have
real(E(i + 1)) = real(E(i)) + (real(W) * real(O(i)) - imag(W) * imag(O(i)))
imag(E(i + 1)) = imag(E(i)) + (real(W) * imag(O(i)) + imag(W) * real(O(i)))
real(O(i + 1)) = real(O(i)) - (real(W) * real(O(i)) - imag(W) * imag(O(i)))
imag(O(i + 1)) = imag(O(i)) - (real(W) * imag(O(i)) + imag(W) * real(O(i)))
Accordingly, we have
4 multiplications
real(W) * real(O(i)),
imag(W) * imag(O(i)),
real(W) * imag(O(i)),
imag(W) * real(O(i)).
6 sums
real(W) * real(O(i)) – imag(W) * imag(O(i)) (1)
real(W) * imag(O(i)) + imag(W) * real(O(i)) (2)
real(E(i)) + eqn.1
imag(E(i)) + eqn.2
real(E(i)) – eqn.1
imag(E(i)) – eqn.2
Therefore, the number of operations for the Decimation In Time radix 2 approach are
2N * log2(N) multiplications
3N * log2(N) additions
These operation counts may change if the multiplications are differently arranged, see Complex numbers product using only three multiplications.
The same results apply to the case of Decimation in Frequency radix 2, see figure

You can estimate flops-performance at the FFTW benchmark page. Slightly outdated but contains results for the most effective FFT implementations.
It seems that rough estimate is about 5000 MFlops for 3.0 GHz Intel Xeon Core Duo

The "fastest available" is not only very processor dependent, but likely to use a completely different algorithm my test. But I counted the flops for a bog-simple non-recursive in-place decimation-in-time radix-2 FFT taken right out of an old ACM algorithms textbook for an FFT of length 1024, and got 20480 fmuls and 30720 fadds (this was using a pre-computed twiddle factor table, thus the transcendental function computations were not included in the flop counts). But note that this code additionally used a ton of integer array index calculations, sine table lookups, and data moves that probably took a lot more of the CPUs cycles than did the FPU. Much larger FFTs would likely also incur a large amount of additional data cache misses and other memory latency penalties. It's possible in that situation to speed up the code by adding more FLOPs in exchange for reduced memory hierarchy latency penalties. So, YMMV.

Related

Sampling from Geometric distribution in constant time

I would like to know if there is any method to sample form the Geometric distribution in constant time without using log which can be hard to approximate. Thanks.
Without relying on logarithms, there is no algorithm to sample from a geometric(p) distribution in constant expected time. Rather, on a realistic computing model, such an algorithm's expected running time must grow at least as fast as 1 + log(1/p)/w, where w is the word size of the computer in bits (Bringmann and Friedrich 2013). The following algorithm, which is equivalent to one in the Bringmann paper, generates a geometric(px/py) random number without relying on logarithms, and when px/py is very small, the algorithm is considerably faster than the trivial algorithm of generating trials until a success:
Set pn to px, k to 0, and d to 0.
While pn*2 <= py, add 1 to k and multiply pn by 2.
With probability (1− px /py)2k, add 1 to d and repeat this step.
Generate a uniform random integer in [0, 2k), call it m, then with probability (1− px /py)m, return d*2k+m. Otherwise, repeat this step.
(The actual algorithm described in the Bringmann paper is in fact much more involved than this; see my note "On a Geometric Sampler".)
REFERENCES:
Bringmann, K., and Friedrich, T., 2013, July. Exact and efficient generation of geometric random variates and random graphs, in International Colloquium on Automata, Languages, and Programming (pp. 267-278).

Efficient way to multiply a large set of small numbers

This question was asked in an interview.
You have an array of small integers. You have to multiply all of them. You need not worry about overflow you have ample support for that. What can you do to speed up the multiplication on your machine?
Would multiple additions be better in this case?
I suggested multiplying using a divide and conquer approach but the interviewer was not impressed. What could be the best possible solution for this?
Here are some thoughts:
Divide-and-Conquer with Multithreading: Split the input apart into n different blocks of size b and recursively multiply all the numbers in each block together. Then, recursively multiply all n / b blocks back together. If you have multiple cores and can run parts of this in parallel, you could save a lot of time overall.
Word-Level Parallelism: Let's suppose that your numbers are all bounded from above by some number U, which happens to be a power of two. Now, suppose that you want to multiply together a, b, c, and d. Start off by computing (4U2a + b) × (4U2c + d) = 16U4ac + 4U2ad + 4U2bc + bd. Now, notice that this expression mod U2 is just bd. (Since bd < U2, we don't need to worry about the mod U2 step messing it up). This means that if we compute this product and take it mod U2, we get back bd. Since U2 is a power of two, this can be done with a bitmask.
Next, notice that
4U2ad + 4U2bc + bd < 4U4 + 4U4 + U2 < 9U4 < 16U4
This means that if we divide the entire expression by 16U4 and round down, we will end up getting back just ad. This division can be done with a bitshift, since 16U4 is a power of two.
Consequently, with one multiplication, you can get back the values of both ac and bd by applying a subsequent bitshift and bitmask. Once you have ac and bd, you can directly multiply them together to get back the value of abcd. Assuming that bitmasks and bitshifts are faster than multiplies, this reduces the number of multiplications necessary by 33% (two instead of three here).
Hope this helps!
Your divide and conquer suggestion was a good start. It just needed more explanation to impress.
With fast multiplication algorithms used to multiply large numbers (big-ints), it is much more efficient to multiply similar sized multiplicands than a series of mismatched sizes.
Here's an example in Clojure
; Define a vector of 100K random integers between 2 and 100 inclusive
(def xs (vec (repeatedly 100000 #(+ 2 (rand-int 99)))))
; Naive multiplication accumulating linearly through the array
(time (def a1 (apply *' xs)))
"Elapsed time: 7116.960557 msecs"
; Simple Divide and conquer algorithm
(defn d-c [v]
(let [m (quot (count v) 2)]
(if (< m 3)
(reduce *' v)
(*' (d-c (subvec v 0 m)) (d-c (subvec v m))))))
; Takes less than 1/10th the time.
(time (def a2 (d-c xs)))
"Elapsed time: 600.934268 msecs"
(= a1 a2) ;=> true (same result)
Note that this improvement does not rely on a set limit for the size of the integers in the array (100 chosen arbitrarily and to demonstrate the next algorithm), but only that they be similar in size. This is a very simple divide an conquer. As the numbers get larger and more expensive to multiply, it would make sense to invest more time in iteratively grouping them by similar size. Here I am relying on random distribution and chance that the sizes will stay similar, but it is still going to be significantly better than the naive approach even for the worst case.
As suggested by Evgeny Kluev in the comments, for a large number of small integers, there is going to be a lot of duplication, so efficient exponentiation is also better than naive multiplication. This depends a lot more on the relative parameters than the divide and conquer, that is the numbers must be sufficiently small relative to the count for enough duplicates to accumulate to bother, but certainly performs well with these parameters (100K numbers in the range 2-100).
; Hopefully an efficient implementation
(defn pow [x n] (.pow (biginteger x) ^Integer n))
; Perform pow on duplications based on frequencies
(defn exp-reduce [v] (reduce *' (map (partial apply pow) (frequencies v))))
(time (def a3 (exp-reduce xs)))
"Elapsed time: 650.211789 msecs"
Note the very simple divide and conquer performed just a wee better in this trial, but would be even relatively better if fewer duplicates were expected.
Of course we can also combine the two:
(defn exp-d-c [v] (d-c (mapv (partial apply pow) (frequencies v))))
(time (def a4 (exp-d-c xs)))
"Elapsed time: 494.394641 msecs"
(= a1 a2 a3 a4) ;=> true (all equal)
Note there are better ways to combine these two since the result of the exponentiation step is going to result in various sizes of multiplicands. The value of added complexity to do so depends on the expected number of distinct numbers in the input. In this case, there are very few distinct numbers so it wouldn't pay to add much complexity.
Note also that both of these are easily parallelized if multiple cores are available.
If many of the small integers occur multiple times, you could start by counting every unique integer. If c(n) is the number of occurrences of integer n, the product can be computed as
P = 2 ^ c(2) * 3 ^ c(3) * 4 ^ c(4) * ...
For the exponentiation steps, you can use exponentiation by squaring which can reduce the number of multiplications considerably.
If the count of numbers really is large compared to the range, then we have seen two asymptotic solutions presented to reduce the complexity considerably. One was based on successive squaring to compute c^k in O(log k) time for each number c, giving O(C mean(log k)) time if the largest number is C and k gives the exponent for each number between 1 and C. The mean(log k) term is maximized if every number appears an equal number of times, so if you have N numbers then the complexity becomes O(C log(N/C)), which is very weakly dependent on N and essentially just O(C) where C specifies the range of numbers.
The other approach we saw was sorting numbers by the number of times they appear, and keeping track of the product of leading numbers (starting with all numbers) and raising this to a power so that the least frequent number is removed from the array, and then updating the exponents on the remaining element in the array and repeating. If all numbers occur the same number of times K, then this gives O(C + log K) which is an improvement over O(C log K). However, say the kth number appears 2^k times. Then this will still give O(C^2 + C log(N/C)) time which is technically worse than the previous method O(C log(N/C)) if C > log(N/C). Thus, if you don't have good information on how evenly distributed the occurrences of each number are, you should go with the first approach, just take the appropriate power of each distinct number that appears in the product by using successive squaring, and take the product of the results. Total time O(C log (N/C)) if there are C distinct numbers and N total numbers.
To answer this question we need to interpret in some way the assumption from OP: need not worry about overflow. In larger part of this answer it is interpreted as "ignore overflows". But I start with several thoughts about other interpretation ("use multiprecision arithmetic"). In this case process of multiplication may be approximately split into 3 stages:
Multiply together small sets of small numbers to get a large set of not-so-small numbers. Some of the ideas from second part of this answer may be used here.
Multiply together these numbers to get a set of large numbers. Either trivial (quadratic time) algorithm or Toom–Cook/Karatsuba (sub-quadratic time) methods may be used.
Multiply together large numbers. Either Fürer's or Schönhage–Strassen algorithm may be used. Which gives O(N polylog N) time complexity for the whole process.
Binary exponentiation may give some (not very significant) speed improvement, because most (if not every) of complex multiplication algorithms mentioned here do squaring faster than multiplication of two unequal numbers. Also we could factorize every "small" number and use binary exponentiation only for prime factors. For evenly distributed "small" numbers this will decrease number of exponentiations by factor log(number_of_values) and slightly improve balance of squarings/multiplications.
Divide and conquer is OK when numbers are evenly distributed. Otherwise (for example when input array is sorted or when binary exponentiation is used) we could do better by placing all multiplicands into priority queue, ordered (may be approximately ordered) by number length. Then we could multiply two shortest numbers and place the result back to the queue (this process is very similar to Huffman encoding). There is no need to use this queue for squaring. Also we should not use it while numbers are not long enough.
More information on this could be found in the answer by A. Webb.
If overflows may be ignored we could multiply the numbers with linear-time or better algorithms.
Sub-linear time algorithm is possible if input array is sorted or input data is presented as set of tuples {value, number of occurrences}. In latter case we could perform binary exponentiation of each value and multiply results together. Time complexity is O(C log(N/C)), where C is number of different values in the array. (See also this answer).
If input array is sorted, we could use binary search to find positions where value changes. This allows to determine how many times each value occurs in the array. Then we could perform binary exponentiation of each value and multiply results together. Time complexity is O(C log N). We could do better by using one-sided binary search here. In this case time complexity is O(C log(N/C)).
But if input array is not sorted, we have to inspect each element, so O(N) time complexity is the best we can do. Still we could use parallelism (multithreading, SIMD, word-level parallelism) to obtain some speed improvement. Here I compare several such approaches.
To compare these approaches I've chosen very small (3-bit) values, which are pretty tightly packed (one value per 8-bit integer). And implemented them in low-level language (C++11) to get easier access to bit manipulation, specific CPU instructions, and SIMD.
Here are all the algorithms:
accumulate from standard library.
Trivial implementation with 4 accumulators.
Word-level parallelism for multiplication, as described in the answer by templatetypedef. With 64-bit word size this approach allows up to 10-bit values (with only 3 multiplications instead of each 4) or it may be applied twice (and I did it in the tests) with up to 5-bit values (requiring only 5 multiplications out of each 8).
Table lookup. In the tests 7 multiplications out of each 8 are substituted by single table lookup. With values larger than in these tests, number of substituted multiplications decreases, slowing down the algorithm. Values larger than 11-12 bits make this approach useless.
Binary exponentiation (see more details below). Values larger than 4 bits make this approach useless.
SIMD (AVX2). This implementation can use up to 8-bit values.
Here are sources for all tests on Ideone. Note that SIMD test requires AVX2 instruction set from Intel. Table lookup test requires BMI2 instruction set. Other tests do not depend on any particular hardware (I hope). I run these tests on 64-bit Linux, compiled with gcc 4.8.1, optimization level -O2.
Here are some more details for binary exponentiation test:
for (size_t i = 0; i < size / 8; i += 2)
{
auto compr = (qwords[i] << 4) | qwords[i + 1];
constexpr uint64_t lsb = 0x1111111111111111;
if ((compr & lsb) != lsb) // if there is at least one even value
{
auto b = reinterpret_cast<uint8_t*>(qwords + i);
acc1 *= accumulate(b, b + 16, acc1, multiplies<unsigned>{});
if (!acc1)
break;
}
else
{
const auto b2 = compr & 0x2222222222222222;
const auto b4 = compr & 0x4444444444444444;
const auto b24 = b4 & (b2 * 2);
const unsigned c7 = __builtin_popcountll(b24);
acc3 += __builtin_popcountll(b2) - c7;
acc5 += __builtin_popcountll(b4) - c7;
acc7 += c7;
}
}
const auto prod4 = acc1 * bexp<3>(acc3) * bexp<5>(acc5) * bexp<7>(acc7);
This code packs values even more densely than in the input array: two values per byte. Low-order bit of each value is handled differently: since we could stop after 32 zero bits is found here (with result "zero"), this case cannot alter performance very much, so it is handled by simplest (library) algorithm.
Out of 4 remaining values, "1" is not interesting, so we need to count only occurrences of "3", "5" , and "7" with bitwise manipulations and "population count" intrinsic.
Here are the results:
source size: 4 Mb: 400 Mb:
1. accumulate: 0.835392 ns 0.849199 ns
2. accum * 4: 0.290373 ns 0.286915 ns
3. 2 mul in 1: 0.178556 ns 0.182606 ns
4. mult table: 0.130707 ns 0.176102 ns
5. binary exp: 0.128484 ns 0.119241 ns
6. AVX2: 0.0607049 ns 0.0683234 ns
Here we can see that accumulate library algorithm is pretty slow: for some reason gcc could not unroll the loop and use 4 independent accumulators.
It is not too difficult to do this optimization "by hand". The result is not particularly fast. But if we allocate 4 threads for this task, CPU would approximately match memory bandwidth (2 channels, DDR3-1600).
Word-level parallelism for multiplications is almost twice as fast. (We'll need only 3 threads to match memory bandwidth).
Table lookup is even faster. But its performance degrades when input array cannot fit in L3 cache. (We'll need 3 threads to match memory bandwidth).
Binary exponentiation has approximately the same speed. But with larger inputs this performance does not degrade, it even slightly improves because exponentiation itself uses less time compared to value counting. (We'll need only 2 threads to match memory bandwidth).
As expected, SIMD is the fastest. Its performance slightly degrades when input array cannot fit in L3 cache. Which means we are close to memory bandwidth with single thread.
I have one solution. Let us discuss it with other solutions.
The key part of question is how to reduce times of multiply. And integers are small but set is big.
My solution:
use an small array to record how many times each number appears.
Remove number 1 from array. You don't need to count it.
Find the number which appears least times n. Then multiply all numbers and get result K. Then count K^n.
Remove this number (For instance, you can switch it with the last number of array and reduce size of array for 1). So next time you won't consider this number any more. At same time, the appearance times of other numbers need to be reduced with the times of removed number.
Once again get the number which appears least times. Do same thing as step 2.
Repeatedly do step 2-4 and complete counting.
Let me use an example to show how many multiply we need to do: Assume
we have 5 numbers [1, 2, 3, 4, 5]. Number 1 appears 100 times, number
2 appears 150 times, number 3 appears 200 times, number 4 appears 300
times, and number 5 appears 400 times.
method 1: multiply it directly or use divide and conquer
we need 100+150+200+300+400-1 = 1149 multiply to get result.
method 2: we do (1^100)(2^150)(3^200)(4^300)(5^400)
(100-1)+(150-1)+(200-1)+(300-1)+(400-1)+4 = 1149.[same as method 1]
Cause n^m will do m-1 multiply in deed. Plus you need time to go through all numbers, though this time is short.
method in this post:
First, you need time to go through all numbers. It can be discarded compare to time of multiply.
The real counting you are making is:
((2*3*4*5)^150)*((3*4*5)^50)*((4*5)^100)*(5^100)
Then you need to do multiply 3+149+2+49+1+99+99+3 = 405 times

Why is x \ y so much slower than (x' * x) \ (x' * y)?

For an NxP matrix x and an Nx1 vector y with N > P, the two expressions
x \ y -- (1)
and
(x' * x) \ (x' * y) -- (2)
both compute the solution b to the matrix equation
x * b = y
in the least squares sense, i.e. so that the quantity
norm(y - x * b)
is minimized. Expression (2) does it using the classic algorithm for the solution of an ordinary least squares regression, where the left-hand argument to the \ operator is square. It is equivalent to writing
inv(x' * x) * (x' * y) -- (3)
but it uses an algorithm which is more numerically stable. It turns out that (3) is moderately faster than (2) even though (2) doesn't have to produce the inverse matrix as a byproduct, but I can accept that given the additional numerical stability.
However, some simple timings (with N=100,000 and P=30) show that expression (2) is more than 5 times faster than expression (1), even though (1) has greater flexibility to choose the algorithm used! For example, any call to (1) could just dispatch on the size of X, and in the case N>P it could reduce to (2), which would add a tiny amount of overhead, but certainly wouldn't take 5 times longer.
What is happening in expression (1) that is causing it to take so much longer?
Edit: Here are my timings
x = randn(1e5, 30);
y = randn(1e5,1);
tic, for i = 1:100; x\y; end; t1=toc;
tic, for i = 1:100; (x'*x)\(x'*y); end; t2=toc;
assert( abs(norm(x\y) - norm((x'*x)\(x'*y))) < 1e-10 );
fprintf('Speedup: %.2f\n', t1/t2)
Speedup: 5.23
You are aware of the fact that in your test
size(x) == [1e5 30] but size(x'*x) == [30 30]
size(y) == [1e5 1] but size(x'*y) == [30 1]
That means that the matrices entering the mldivide function differ in size by 4 orders of magnitude! This would render any overhead of determining which algorithm to use rather large and significant (and perhaps also running the same algorithm on the two different problems).
In other words, you have a biased test. To make a fair test, use something like
x = randn(1e3);
y = randn(1e3,1);
I find (worst of 5 runs):
Speedup: 1.06 %// R2010a
Speedup: 1.16 %// R2010b
Speedup: 0.97 %// R2013a
...the difference has all but evaporated.
But, this does show very well that if you indeed have a regression problem with low dimensionality compared to the number of observations, it really pays off to do the multiplication first :)
mldivide is a catch-all, and really great at that. But often, having knowledge about the problem may make more specific solutions, like pre-multiplication, pre-conditioning, lu, qr, linsolve, etc. orders of magnitude faster.
even though (1) has greater flexibility to choose the algorithm used!
For example, any call to (1) could just dispatch on the size of X, and
in the case N>P it could reduce to (2), which would add a tiny amount
of overhead, but certainly wouldn't take 5 times longer.
This is not the case. It could take a lot of overhead to choose which algorithm to use, particularly when compared to the computation on relatively small inputs such as these. In this case, because MATLAB can see that you have x'*x, it knows that one of the arguments must be both square and symmetric (yes - that knowledge of linear algebra is built in to MATLAB even at a parser level), and can straight away call one of the appropriate code paths within \.
I can't say whether this fully explains the timing differences you're seeing. I would want to investigate further, at least by:
Making sure to put the code within a function, and warming the function up to ensure that the JIT is engaged - and then trying the same thing with feature('accel', 'off') to remove the effect of the JIT
Trying this on a much bigger range of input sizes to check what contribution an 'algorithm choice overhead' made compared to computation time.

Segmented Sieve of Atkin, possible?

I am aware of the fact that the Sieve of Eratosthenes can be implemented so that it finds primes continuosly without an upper bound (the segmented sieve).
My question is, could the Sieve of Atkin/Bernstein be implemented in the same way?
Related question: C#: How to make Sieve of Atkin incremental
However the related question has only 1 answer, which says "It's impossible for all sieves", which is obviously incorrect.
Atkin/Bernstein give a segmented version in Section 5 of their original paper. Presumably Bernstein's primegen program uses that method.
In fact, one can implement an unbounded Sieve of Atkin (SoA) not using segmentation at all as I have done here in F#. Note that this is a pure functional version that doesn't even use arrays to combine the solutions of the quadratic equations and the squaresfree filter and thus is considerably slower than a more imperative approach.
Berstein's optimizations using look up tables for optimum 32-bit ranges would make the code extremely complex and not suitable for presentation here, but it would be quite easy to adapt my F# code so that the sequences start at a set lower limit and are used only over a range in order to implement a segmented version, and/or applying the same techniques to a more imperative approach using arrays.
Note that even Berstein's implementation of the SoA isn't really faster than the Sieve of Eratosthenes with all possible optimizations as per Kim Walisch's primesieve but is only faster than an equivalently optimized version of the Sieve of Eratosthenes for the selected range of numbers as per his implementation.
EDIT_ADD: For those who do not want to wade through Berstein's pseudo-code and C code, I am adding to this answer to add a pseudo-code method to use the SoA over a range from LOW to HIGH where the delta from LOW to HIGH + 1 might be constrained to an even modulo 60 in order to use the modulo (and potential bit packing to only the entries on the 2,3,5 wheel) optimizations.
This is based on a possible implementation using the SoA quadratics of (4*x^2 + y^), (3*x^2 + y^2), and (3*x^2 -y^2) to be expressed as sequences of numbers with the x value for each sequence fixed to values between one and SQRT((HIGH - 1) / 4), SQRT((HIGH - 1) / 3), and solving the quadratic for 2*x^2 + 2*x - HIGH - 1 = 0 for x = (SQRT(1 + 2 * (HIGH + 1)) - 1) / 2, respectively, with the sequences expressed in my F# code as per the top link. Optimizations to the sequences there use that when sieving for only odd composites, for the "4x" sequences, the y values need only be odd and that the "3x" sequences need only use odd values of y when x is even and vice versa. Further optimization reduce the number of solutions to the quadratic equations (= elements in the sequences) by observing that the modulo patterns over the above sequences repeat over very small ranges of x and also repeat over ranges of y of only 30, which is used in the Berstein code but not (yet) implemented in my F# code.
I also do not include the well known optimizations that could be applied to the prime "squares free" culling to use wheel factorization and the calculations for the starting segment address as I use in my implementations of a segmented SoE.
So for purposes of calculating the sequence starting segment addresses for the "4x", "3x+", and "3x-" (or with "3x+" and "3x-" combined as I do in the F# code), and having calculated the ranges of x for each as per the above, the pseudo-code is as follows:
Calculate the range LOW - FIRST_ELEMENT, where FIRST_ELEMENT is with the lowest applicable value of y for each given value of x or y = x - 1 for the case of the "3x-" sequence.
For the job of calculating how many elements are in this range, this boils down to the question of how many of (y1)^2 + (y2)^2 + (y3)^2... there are where each y number is separated by two, to produce even or odd 'y's as required. As usual in square sequence analysis, we observe that differences between squares have a constant increasing increment as in delta(9 - 1) is 8, delta(25 - 9) is 16 for an increase of 8, delta (49 - 25) is 24 for a further increase of 8, etcetera. so that for n elements the last increment is 8 * n for this example. Expressing the sequence of elements using this, we get it is one (or whatever one chooses as the first element) plus eight times the sequence of something like (1 + 2 + 3 + ...+ n). Now standard reduction of linear sequences applies where this sum is (n + 1) * n / 2 or n^2/2 + n/2. This we can solve for how many n elements there are in the range by solving the quadratic equation n^2/2 + n/2 - range = 0 or n = (SQRT(8*range + 1) - 1) / 2.
Now, if FIRST_ELEMENT + 4 * (n + 1) * n does not equal LOW as the starting address, add one to n and use FIRST_ELEMENT + 4 * (n + 2) * (n + 1) as the starting address. If one uses further optimizations to apply wheel factorization culling to the sequence pattern, look up table arrays can be used to look up the closest value of used n that satisfies the conditions.
The modulus 12 or 60 of the starting element can be calculated directly or can be produced by use of look up tables based on the repeating nature of the modulo sequences.
Each sequence is then used to toggle the composite states up to the HIGH limit. If the additional logic is added to the sequences to jump values between only the applicable elements per sequence, no further use of modulo conditions is necessary.
The above is done for every "4x" sequence followed by the "3x+" and "3x-" sequences (or combine "3x+" and "3x-" into just one set of "3x" sequences) up to the x limits as calculated earlier or as tested per loop.
And there you have it: given an appropriate method of dividing the sieve range into segments, which is best used as fixed sizes that are related to the CPU cache sizes for best memory access efficiency, a method of segmenting the SoA just as used by Bernstein but somewhat simpler in expression as it mentions but does not combine the modulo operations and bit packing.

Autocorrelation Heuristics for a Tuner

I've implemented a simple autocorrelation routine against some audio samples at a rate of 44100.0 with a block size of 2048.
The general formula I am following looks like this:
r[k] = a[k] * b[k] = ∑ a[n] • b[n + k]
and I've implemented it in a brute-force nested loop as follows:
for k = 0 to N-1 do
for n = 0 to N-1 do
if (n+k) < N
then r[k] := r[k] + a(n)a(n+k)
else
break;
end for n;
end for k;
I look for the max magnitude in r and determine how many samples away it is and calculate the frequency.
To help temper the tuner's results, I am using a circular buffer and returning the median each time.
The brute force calculations are a bit slow - is there a well-known, faster way to do them?
Sometimes, the tuner just isn't quite as accurate as is needed. What type of heuristics can I apply here to help refine the results?
Sometimes the OCTAVE is incorrect - is there a way to hone in on the correct octave a bit more accurately?
The efficient way to do autocorrelation is with an FFT:
FFT the time domain signal
convert complex FFT output to magnitude and zero phase (i.e. power spectrum)
take inverse FFT
This works because autocorrelation in the time domain is equivalent to power spectrum in the frequency domain.
Having said that, bare bones autocorrelation is not a great way to implement (accurate) pitch detection in general, so you might want to have a rethink about your whole approach.
One simple way to improve this "brute force" autocorrelation method is to limit the range of k and only search for lags (or pitch periods) near the previous average period, say within +-0.5 semitones at first. If you don't find a correlation, then search a slightly wider range, say, a within a major third, then search a wider range but within the expected frequency range of the instrument being tuned.
You can get higher frequency resolution by using a higher sample rate (say, upsampling the data before the autocorrelation if necessary, and with proper filtering).
You will get autocorrelation peaks for the pitch lag (period) and for multiples of that lag. You will have to eliminate those subharmonics somehow (maybe as impossible for the instrument, or perhaps as an unlikely pitch jump from the previous frequency estimations.)
I don't fully understand the question, but I can point out one trick that you might be able to use. You say you look for the sample that is the max magnitude. If it is useful in the rest of your calculations, you can calculate that sample number to sub-sample precision.
Let's say you have a peak of 0.9 at sample 5 and neighboring samples of 0.1 and 0.8. The actual peak is probably somewhere between sample 5 and sample 6.
(0.1 * 4 + 0.9 * 5 + 0.8 * 6) / (0.1 + 0.9 + 0.8) = 5.39

Resources