Given any two sequences/vectors of M real numbers, I can easily compute their closeness or correlation using a variety of metrics/norms. But is there an efficient structure to look up the closest M-sequence in a corpus of sequences, or the closest subsequence of a longer sequence? A sliding window would be the naive/brute-force approach. Does anyone know of anything better, though?
EDIT: As I'm typing this, I'm thinking that something like searching in a K-d tree might work, where each offset is a separate dimension in an M-dimensional space?
The problem with acceleration structures (such as K-d trees) is that they become less effective as the dimensionality (M, in the question) increases. If your M is very large, you might be better off with a linear search.
If your M is of moderate size (up to something like 6 or so, as a ballpark guess?), it may be worth trying a K-d tree. There are search structures available for higher-dimensional spaces; I recommend looking up Foundations of Multidimensional and Metric Data Structures, by Samet.
If a sliding window would work, you're probably doing a cross-correlation, in which case you can use FFTs to solve your problem faster by a factor of O(n/log(n)).
So if you have a vector V, and a corpus of C other vectors, and all vectors are size N, then the sliding window solution would take O(N^2 * C) time. By using FFTs you can reduce a single sliding window from O(N^2) to O(N log N), so the total time would be O(CN log N).
If you aren't familiar with FFTs then you will probably need to read up on them before using them, but the general idea is this:
# If you forget to take the complex conjugate of V you'll be doing a
# convolution instead of a correlation
V' := Fft(Conjugate(V))
for each vector W in C:
W' := Fft(W)
P := W' * V' # Multiplication here is the dot product
R := inverse_Fft(P)
# Check through the vector R for any spikes, a large value at
# R[i] indicates that if you shift W' by i then it will
# correlate strongly with W
Caveats:
1) If you're doing correlations at all you'll need to normalize your vectors, or at least do something to make sure you don't get false positives from vectors whose values are just larger and more positive than other vectors. If yours is a typical use case of looking for a signal in noise, though, then you're fine.
2) FFTs correlate under the assumption that all of these signals are circular. If you don't want to treat them like they're circular then you need to add a buffer of 0's to the end of each vector to double its length.
Related
I want to find ALL local maximums in a N*N matrix, with a constraint that every 2 peaks found must be at least M cells away (in both directions). In other words, for very peak P found, local maximums within (2M+1)*(2M+1) sub-matrix around P are ignored, if that peak is lower than P.
By local maximum I mean the largest element in the (2M+1)*(2M+1) submatrix centered at the element.
For the naive method, the complexity is O(N*N*M*M). Is there an efficient algorithm to achieve this?
This is a sample matrix for N=5 and M=1 (3*3 grid):
As your matrix appears to be something like an image, using image processing techniques appears to be the natural choice.
You could define peaks (local maxima or minima) as image regions with zero crossing of both local partial derivatives. If you want maxima look for negative curvature at these places, if you're looking for minima watch out for positive curvature (curvature -> second order derivative).
There are linear convolution operators available (and a whole lot of theory behind them), that produce the partial derivatives in x and y direction (e.g., Sobel, Prewitt) and second order derivatives.
There's even algorithms for blob detection already, which appears to be related to your task (e.g., Laplacian of Gaussian).
If you are looking for speed, you might want to see if you can benefit of linear separability, precomputation of filter kernels (associativity), or DFT. Also note that this kind of tasks usually benefit hugely of parallelization. See if you can leverage more than one core, a GPU or an FPGA for some performance boost.
I would use a floodfill approach (it's not actually floodfill, but floodfill was what I had in mind when I came up with it):
Find all minima. Put them in a sorted list/stack.
Pick (and remove) the first item from the list (lowest minimum).
If the element is marked as used, discard the item and go to 2.
Mark all elements inside the submatrix around the item as used.
Go to 2.
The algortihm ends when the list is empty.
Total cost: O(N*N + p * log p + p * M * M) where p is the number of minima.
I've been trying to find some places to help me better understand DFT and how to compute it but to no avail. So I need help understanding DFT and it's computation of complex numbers.
Basically, I'm just looking for examples on how to compute DFT with an explanation on how it was computed because in the end, I'm looking to create an algorithm to compute it.
I assume 1D DFT/IDFT ...
All DFT's use this formula:
X(k) is transformed sample value (complex domain)
x(n) is input data sample value (real or complex domain)
N is number of samples/values in your dataset
This whole thing is usually multiplied by normalization constant c. As you can see for single value you need N computations so for all samples it is O(N^2) which is slow.
Here mine Real<->Complex domain DFT/IDFT in C++ you can find also hints on how to compute 2D transform with 1D transforms and how to compute N-point DCT,IDCT by N-point DFT,IDFT there.
Fast algorithms
There are fast algorithms out there based on splitting this equation to odd and even parts of the sum separately (which gives 2x N/2 sums) which is also O(N) per single value, but the 2 halves are the same equations +/- some constant tweak. So one half can be computed from the first one directly. This leads to O(N/2) per single value. if you apply this recursively then you get O(log(N)) per single value. So the whole thing became O(N.log(N)) which is awesome but also adds this restrictions:
All DFFT's need the input dataset is of size equal to power of two !!!
So it can be recursively split. Zero padding to nearest bigger power of 2 is used for invalid dataset sizes (in audio tech sometimes even phase shift). Look here:
mine Complex->Complex domain DFT,DFFT in C++
some hints on constructing FFT like algorithms
Complex numbers
c = a + i*b
c is complex number
a is its real part (Re)
b is its imaginary part (Im)
i*i=-1 is imaginary unit
so the computation is like this
addition:
c0+c1=(a0+i.b0)+(a1+i.b1)=(a0+a1)+i.(b0+b1)
multiplication:
c0*c1=(a0+i.b0)*(a1+i.b1)
=a0.a1+i.a0.b1+i.b0.a1+i.i.b0.b1
=(a0.a1-b0.b1)+i.(a0.b1+b0.a1)
polar form
a = r.cos(θ)
b = r.sin(θ)
r = sqrt(a.a + b.b)
θ = atan2(b,a)
a+i.b = r|θ
sqrt
sqrt(r|θ) = (+/-)sqrt(r)|(θ/2)
sqrt(r.(cos(θ)+i.sin(θ))) = (+/-)sqrt(r).(cos(θ/2)+i.sin(θ/2))
real -> complex conversion:
complex = real+i.0
[notes]
do not forget that you need to convert data to different array (not in place)
normalization constant on FFT recursion is tricky (usually something like /=log2(N) depends also on the recursion stopping condition)
do not forget to stop the recursion if N=1 or 2 ...
beware FPU can overflow on big datasets (N is big)
here some insights to DFT/DFFT
here 2D FFT and wrapping example
usually Euler's formula is used to compute e^(i.x)=cos(x)+i.sin(x)
here How do I obtain the frequencies of each value in an FFT?
you find how to obtain the Niquist frequencies
[edit1] Also I strongly recommend to see this amazing video (I just found):
But what is the Fourier Transform A visual introduction
It describes the (D)FT in geometric representation. I would change some minor stuff in it but still its amazingly simple to understand.
I'm looking for an efficient algorithm that for a space with known height, width and length, given a fixed radius R, and a list of points N, with 3-dimensional coordinates in that space, will find all the points within a fixed radius R of an arbitrary point on the grid. This query will be done many times with different points, so an expensive pre-processing/sorting step, in exchange for quick queries may be worth it. This is a bit of a bottleneck step of an application I'm working on, so any time I can cut off of it is useful
Things I have tried so far:
-The naive algorithm, iterate over all points and calculate distance
-Divide the space into a grid with cubes of length R, and put the points into these. That way, for each point, I only have to ever query the immediate neighboring buckets. This has a significant speedup
-I've tried using the manhattan distance as a heuristic. That is, within the buckets, before calculating a distance to any point, use the manhattan distance to filter out those that can't possibly be within radius R (that is, those with a manhattan distance of <= sqrt(3)*R). I thought this would offer a speedup, as it only needs addition instead of multiplication, but it actually slowed the program down by a little bit
EDIT: To compare the distances, I use the squared distance to eliminate having to use a sqrt function.
Obviously, there will be some limit on how much I can speed this up, but I could use any suggestions on things to try now.
Not that it probably matters on the algorithmic level, but I'm working in C.
You may get a speed benefit from storing your points in a k-d tree with three dimensions. That will give you searchs in O(log n) amortized time.
Don't compare on the radius, compare on the square of the radius. The reason being is, if the distance between two points is less than R, then the square of the distance is less than R^2.
This way, when you're using the distance formula, you don't need to compute the square root, which is a very expensive operation.
I would recommend using either K-D tree or z-curve:
http://en.wikipedia.org/wiki/Z-order_%28curve%29
How about Binary Indexed Tree ? (Topcoder tutorials referred) It can be extended to n Dimensions,and is simpler to code.
Nicolas Brodu's NEIGHAND library do exactly what you want, improving on the bin-lattice algorithm.
More details can be found in his article: Query Sphere Indexing for Neighborhood Requests
[I might be misunderstanding the question. I'm finding the problem statement difficult to parse.]
In the old days, it was often good to design a this type of algorithm with "early outs" that do tests to try to avoid a more expensive calculation. In modern processors, a failure of a branch-prediction is often very expensive, and those early-out tests can actually be more expensive that the full calculation. (The only way to know for sure is to measure.)
In this case, the calculation is pretty simple, so it may be best to avoid building a data structure or doing any clever early-out checks and instead try to optimize, vectorize, and parallelize to get the throughput you need.
For a point P(x, y, z) and a sphere S(x_s, y_s, z_s, radius), the membership test is:
(x - x_s)^2 + (x - y_s)^2 + (z - z_s)^2 < radius^2
where radius^2 can be pre-calculated once for all the points in the query (avoiding any square root calculations). These calculations are all independent, you can compute it for several points in parallel. With something like SSE, you could probably do four at a time. And if you have many points to test, you could split the list and further parallelize the work across multiple cores.
I two sets of k-dimensional vectors, where k is around 500 and the number of vectors is usually smaller. I want to compute the (arbitrarily defined) minimal distance between the two sets.
A naive approach would be this:
(loop for a in set1
for b in set2
minimizing (distance a b))
However, this requires O(n² * distance) computations. Is there a faster way of doing this?
I don't think you can do better than O(n^2) when the distance is arbitrary (you have to examine each of the possible distances!). For a given distance function we might be able to exploit the properties of the function, but there won't be any general algorithm which works with any distance function in better than O(n^2) (i.e. o(n^2) : note smallOh).
If your data is dynamic and you have to keep obtaining the closest pair of points at different times, for arbitrary distance function the following papers by Eppstein will probably help (which have special update operations in order to make finding the closest pair of points quick):
http://www.ics.uci.edu/~eppstein/projects/pairs/Papers/Epp-SODA-98.pdf. [O(nlog^2(n)) update time]
http://academic.research.microsoft.com/Paper/1847461.aspx
You will be able to adapt the above one set algorithms to a two set algorithm (for instance, by defining distance between points of same set to be infinity).
For Euclidean type (L^p) distance, there are known O(nlogn) time algorithms, which work with a given set of points (i.e. you dont need to have any special update algorithms):
http://www.cse.iitd.ernet.in/~ssen/cs852/scribe/scribe2/lec.pdf
http://en.wikipedia.org/wiki/Closest_pair_of_points_problem
Of course, the L^p is for one set, but you might be able to adapt it for two sets.
If you give your distance function, it might be easier for us to help you.
Hope it helps. Good luck!
If the components of your vectors are scalars I would guess that for your case of a moderate k=500 the O(n²) approach is probably as fast as you can get. You can simplify your calculation by minimizing distance². Also, the distance(A_i, B_i) = distance(B_i, A_i), so make sure you only compare them once (you only have 500!/(500-2)! pairs, not 500²).
If the components are m-dimensional vectors A and B instead, you could store the components of vector A in a R-tree or a kd-tree and then find the closest pair by iterating over all components of vector B and finding its closest partner from A--- this would be O(n). Don't forget that big-O is for n->infinity, so the trees might come with some pretty expensive constant term (i.e. this approach might only make sense for large k or if vector A is always the same).
Put the two sets of coordinates into a Spatial Index, e.g. a KD-tree.
You then compute the intersection of these two indices.
First of all, the title is very bad, due to my lack of a concise vocabulary. I'll try to describe what I'm doing and then ask my question again.
Background Info
Let's say I have 2 matrices of size n x m, where n is the number of experimental observation vectors, each of length m (the time series over which the observations were collected). One of these matrices is the original matrix, called S, the other which is a reconstructed version of S, called Y.
Let's assume that Y properly reconstructs S. However due to the limitations of the reconstruction algorithm, Y can't determine the true amplitude of the vectors in S, nor is it guaranteed to provide the proper sign for those vectors (the vectors might be flipped). Also, the order of the observation vectors in Y might not match the original ordering of the corresponding vectors in S.
My Question
Is there an algorithm or technique to generate a new matrix which is a 'realignment' of Y to S, so that when Y and S are normalized, the algorithm can (1) find the vectors in Y that match the vectors in S and restore the original ordering of the vectors and (2) likewise match the signs of the vectors?
As always, I really appreciate all help given. Thanks!
How about simply calculating the normalized form for each vector in both matrices and comparing? That should give you an exacty one-to-one match for each vector in each matrix.
The normal form of a vector is one that conforms to:
v_norm = v / ||v||
where ||v|| is the euclidean norm for the vector. For v=(v1, v2, ..., vn) we have ||v|| = sqrt(v1^2 + ... + vn^2).
From there you can reconstruct their order, and return each vector its original length and direction (the vector or its opposite).
The algorithm should be fairly simple from here on, just decide on your implementation. This method should be of quadratic complexity. Per the comment, you can indeed achieve O(nlogn) complexity on this algorithm. If you need something better than that, linear complexity - specifically, you're going to need a much more complicated algorithm which I can't think of right now.