computing derivatives of discrete periodic data - methods

I have an array y[x], x=0,1,2,...,10^6 describing a periodic signal with y(10^6)=y(0), and I want to compute its derivative dy/dx with a fast method.
I tried the spectral difference method, namely
dy/dx = inverse_fourier_transform ( i*k fourier_transform(y)[k] ) .................(1)
and the result is different from (y[x+1]-y[x-1])/2 i.e. suggested by finite difference method.
Which of the two is more accurate, and which is faster? Are there other comparable methods?
Below is an effort to understand the difference of the results:
If one expand both the sum for the fourier_transform and that for the inverse_fourier_transform in (1), one can express dy/dx as a linear combination of y[x] with coefficients a[x]. I computed these coefficients and they seem to go as 1/n (when the length of the array goes to infinity) with n being the distance to where the derivative is examined. Compared to the finite differencing method which uses only the two neighboring points, the spectral difference is highly non-local... Am I correct with this result, and if yes, how to understand this?

if you are sampling the signal above the nyquist frequency then the fourier method gives you an exact answer because your data completely describe the signal (assuming no noise).
the finite difference method is a first order approximation and so is not exact. but still, if you plot the two, they should show the same basic trends. if they look completely different then you probably have an error somewhere.
however, a fast ft is O(nlog(n)) while finite differences are O(n) so the latter is faster (but not so much faster that it should be automatically preferred).
the fourier approach is non-local in the sense that it constructs the whole signal, exactly (and so uses all wavelengths).


Could Halley's method iteration be better than Newton's in some cases?

I am investigating, which division algorithm will be most efficient on high-precision number inputs. One of the factors (at least with iterative methods) is choosing the best possible starting point. However, even more important could be the choice of method itself.
When trying to compute the reciprocal of a number, a, one of the methods used is the Newton iteration
This has quadratic convergence. However, a closely related Halley iteration
has cubic convergence, but a greater computational cost. Is it possible that in high precision applications, Halley's method would be superior to Newton's, despite its greater computational cost? Or maybe a combination of the two methods would be even better - starting with Newton to get some amount of precision fast and continuing with Halley...

Linear time when running merge sort and quick sort

As far I learned from my University, it is proved that the lower bound of an comparison-based algorithm that sorts random data is Ω(nlogn). I also know that the average case of Heapsort and Quicksort is O(nlgn).
Therefore, I tried to plot the times these algorithms needed to sort a set of random data.
I used the algorithms that are posted at Roseta Code: quicksort and heapsort. When I tried to plot the time each one needed to sort random data, for up to 1 Million numbers, I got the following charts that appear to be linear:
You can also find the results I got from running heapsort from here.
In addition you can also find the results I got from running quicksort from here
However, I do get O(n^2) time complexity when running bubblesort as shown at the plot below:
Why is this? What I might be missing here?
The difference is simply too small to see with the naked eye at this scale:
Using your HeapSort results (600ms for 1000000 entries), here is a O(n) function (green) and a O(n log n) function (red):
(from )
The two functions in this picture are:
y = 600/1000000 * x green
y = 1/(10000 log(10)) * x*log(x) red
(Note that these functions have vastly different constant scaling factors, but of course these do not matter in Big-O notation.)
However just because they are hard to see in a diagram, does not mean they are impossible to distinguish.
As mentioned in the comments, your main options are bigger datasets, or slower comparison functions. Most sorting algorithms will allow you to specify a comparison function, which should not, under normal circumstances, change the O() time complexity. (beware of non-transitive comparison functions though)
If that is not possible, and you just want to thread the algorithms as black boxes, you could simply repeat the experiment and average the results until noise is low enough to distinguish between these two curves.
To get the appropriate "ideal" n log n curve for comparison with your averaged data, you need to solve the equation y = a*x * log(x); y=MAXIMUM_TIME; x=MAXIMUM_INPUT_LENGTH;, for example with Wolfram Alpha
One important point here is that even though these curves look similar, this does not mean that the run-time of a hypothetical linear sorting algorithm would not be worthwhile for less than a million entries. If you managed to come up with a linear sorting algorithm with the same constant factor as the n log n algorithm, the curves would look like this:

Is there an algorithm better than O(N²) to determine if matrix is symmetric?

Algorithm requirements
Input is an arbitrary square matrix M of size N×N, which just fits in memory.
The algorithm's output must be true if M[i,j] = M[j,i] for all j≠i, false otherwise.
Obvious solutions
Check if the transpose equals the matrix itself (MT=M). Easiest to program in many environments, but (usually) consumes twice the memory and requires N² comparisons worst case. Therefore, this is O(N²) and has high peak memory.
Check if the lower triangular part equals the upper triangular part. Of course, the algorithm returns on the first inequality found. This would make the worst case (worst case being, the matrix is indeed symmetric) require N²/2 - N comparisons, since the diagonal does not need to be checked. So although it is better than option 1, this is still O(N²).
Although it's hard to see how it would be possible (the N² elements will all have to be compared somehow), is there an algorithm doing this check that is better than O(N²)?
Or, provided there is a proof of non-existence of such an algorithm: how to implement this most efficiently for a multi-core CPU (Intel or AMD) taking into account things like cache-friendliness, optimal branch prediction, other compiler-specific specializations, etc.?
This question stems mostly from academic interest, although I imagine a practical use could be to determine what solver to use if the matrix describes a linear system AX=b...
Since you will have to examine all the elements except the diagonal, the complexity IMO can't be better than O (n^2).
For a dense matrix, the answer is a definite "no", because any uninspected (non-diagonal) elements could be different from their transposed counterparts.
For standard representations of a sparse matrix, the same reasoning indicates that you can't generally do better than the input size.
However, the same reasoning doesn't apply to arbitrary matrix representations. For example, you could store sparse representations of the symmetric and antisymmetric components of your matrix, which can easily be checked for symmetry in O(1) time by checking if antisymmetric element has any components at all...
I think you can take a probabilistic approach here.
I think it's not a chance/coincidence that x randomly picked lower coordinate elements will match to their upper triangular counter part. The chance is very high that the matrix is indeed symmetric.
So instead of going through all the ½n² - n elements you can check p random coordinates and tell if the matrix is symmetric with confidence:
p / (½n² - n)
you can then decide a threshold above which you believe that the matrix must be a symmetric matrix.

Fast algorithm to calculate Pi in parallel

I am starting to learn CUDA and I think calculating long digits of pi would be a nice, introductory project.
I have already implemented the simple Monte Carlo method which is easily parallelize-able. I simply have each thread randomly generate points on the unit square, figure out how many lie within the unit circle, and tally up the results using a reduction operation.
But that is certainly not the fastest algorithm for calculating the constant. Before, when I did this exercise on a single threaded CPU, I used Machin-like formulae to do the calculation for far faster convergence. For those interested, this involves expressing pi as the sum of arctangents and using Taylor series to evaluate the expression.
An example of such a formula:
Unfortunately, I found that parallelizing this technique to thousands of GPU threads is not easy. The problem is that the majority of the operations are simply doing high precision math as opposed to doing floating point operations on long vectors of data.
So I'm wondering, what is the most efficient way to calculate arbitrarily long digits of pi on a GPU?
You should use the Bailey–Borwein–Plouffe formula
Why? First of all, you need an algorithm that can be broken down. So, the first thing that came to my mind is having a representation of pi as an infinite sum. Then, each processor just computes one term, and you sum them all in the end.
Then, it is preferable that each processor manipulates small-precision values, as opposed to very high precision ones. For example, if you want one billion decimals, and you use some of the expressions used here, like the Chudnovsky algorithm, each of your processor will need to manipulate a billion long number. That's simply not the appropriate method for a GPU.
So, all in all, the BBP formula will allow you to compute the digits of pi separately (the algorithm is very cool), and with "low precision" processors! Read the "BBP digit-extraction algorithm for π"
Advantages of the BBP algorithm for computing π
This algorithm computes π without requiring custom data types having thousands or even millions of digits. The method calculates the nth digit without calculating the first n − 1 digits, and can use small, efficient data types.
The algorithm is the fastest way to compute the nth digit (or a few digits in a neighborhood of the nth), but π-computing algorithms using large data types remain faster when the goal is to compute all the digits from 1 to n.

What algorithms have high time complexity, to help "burn" more CPU cycles?

I am trying to write a demo for an embedded processor, which is a multicore architecture and is very fast in floating point calculations. The problem is that the current hardware I have is the processor connected through an evaluation board where the DRAM to chip rate is somewhat limited, and the board to PC rate is very slow and inefficient.
Thus, when demonstrating big matrix multiplication, I can do, say, 128x128 matrices in a couple of milliseconds, but the I/O takes (lots of) seconds kills the demo.
So, I am looking for some kind of a calculation with higher complexity than n^3, the more the better (but preferably easy to program and to explain/understand) to make the computation part more dominant in the time budget, where the dataset is preferably bound to about 16KB per thread (core).
Any suggestion?
PS: I think it is very similar to this question in its essence.
You could generate large (256-bit) numbers and factor them; that's commonly used in "stress-test" tools. If you specifically want to exercise floating point computation, you can build a basic n-body simulator with a Runge-Kutta integrator and run that.
What you can do is
Declare a std::vector of int
populate it with N-1 to 0
Now keep using std::next_permutation repeatedly until they are sorted again i..e..next_permutation returns false.
With N integers this will need O(N !) calculations and also deterministic
PageRank may be a good fit. Articulated as a linear algebra problem, one repeatedly squares a certain floating-point matrix of controllable size until convergence. In the graphical metaphor, one "ripples" change coming into each node onto the other edges. Both treatments can be made parallel.
You could do a least trimmed squares fit. One use of this is to identify outliers in a data set. For example you could generate samples from some smooth function (a polynomial say) and add (large) noise to some of the samples, and then the problem is to find a subset H of the samples of a given size that minimises the sum of the squares of the residuals (for the polynomial fitted to the samples in H). Since there are a large number of such subsets, you have a lot of fits to do! There are approximate algorithms for this, for example here.
Well one way to go would be to implement brute-force solver for the Traveling Salesman problem in some M-space (with M > 1).
The brute-force solution is to just try every possible permutation and then calculate the total distance for each permutation, without any optimizations (including no dynamic programming tricks like memoization).
For N points, there are (N!) permutations (with a redundancy factor of at least (N-1), but remember, no optimizations). Each pair of points requires (M) subtractions, (M) multiplications and one square root operation to determine their pythagorean distance apart. Each permutation has (N-1) pairs of points to calculate and add to the total distance.
So order of computation is O(M((N+1)!)), whereas storage space is only O(N).
Also, this should not be either too hard, nor too intensive to parallelize across the cores, though it does take some overhead. (I can demonstrate, if needed).
Another idea might be to compute a fractal map. Basically, choose a grid of whatever dimensionality you want. Then, for each grid point, do the fractal iteration to get the value. Some points might require only a few iterations; I believe some will iterate forever (chaos; of course, this can't really happen when you have a finite number of floating-point numbers, but still). The ones that don't stop you'll have to "cut off" after a certain number of iterations... just make this preposterously high, and you should be able to demonstrate a high-quality fractal map.
Another benefit of this is that grid cells are processed completely independently, so you will never need to do communication (not even at boundaries, as in stencil computations, and definitely not O(pairwise) as in direct N-body simulations). You can usefully use O(gridcells) number of processors to parallelize this, although in practice you can probably get better utilization by using gridcells/factor processors and dynamically scheduling grid points to processors on an as-ready basis. The computation is basically all floating-point math.
Mandelbrot/Julia and Lyupanov come to mind as potential candidates, but any should do.
