Large Residual-Online Outlier Detection for Kalman Filter

Large Residual-Online Outlier Detection for Kalman Filter - filter

I am trying to find outliers in Residual. I used three algorithms basically if the residuals magnitudes are less, the algorithm performances are good but if the residuals magnitude are big, the algorithm performances are not good.
1) 𝑿^𝟐=〖(𝒚−𝒉(𝒙))〗^𝑻 𝑺^(−𝟏) (𝒚−𝒉(𝒙)) - Chi-Square Test
if the matrix 3x3 - degree of freedom is 4.
𝑿^𝟐 > 13.277
2) Residual(i) > 3√(HP 𝐻^𝑇 + R) - Measurement Covariance Noise
3) Residual(i) > 3-Sigma
I have applied three algorithms to find the outliers. First one is Chi Square Test, second checks Measurement Covariance Noise, Third looks the 3 sigma.
Can you give any suggestion about the algorithms or I can implement a new way if you suggest?

The third case cannot be correct for all case because if there is a large residual, will fail. The second one is more stable because it is related to measurement noise covariance so that your residual should change according to the measurement covariance error.

Related

Why is there logarithm (and the square root) in the UCB formula of Monte Carlo Tree Search?

I studied Monte Carlo Tree Search (UCT) from several sources, like this: http://www.incompleteideas.net/609%20dropbox/other%20readings%20and%20resources/MCTS-survey.pdf
However, I didn't understand why there is logarithm (and the square root) in the UCB formula of Monte Carlo Tree Search (section 2.4.2 and 3.3.1).
The formula is the following:

The left hand side of the equation is a maximum likelihood estimation, that is it is just a measurement of the viewed win rate for this node and the right hand side is an estimate of uncertainty. The more uncertain we are, the higher we value the node and this promotes exploration.
Ultimately the use of whatever functions comes down to designing the shape of the curve they want to exhibit in their algorithm and how that shape was decided on you'll have to read the literature. If you would like to visualize the shape of the curve you can simply type graph square root of (ln x / y) into google search and it will give you an interactive graph you can inspect.
Logarithms are often used in uncertainty measurements because it's a way of saying each incremental value added has less impact than the one before so as we have more information it changes our estimate less and less which makes sense because the more information we have, the more we trust our estimate is correct.
Square roots do the same thing except to a lesser degree. There is a difference in the shape of the curves between the values of 0 and 1 however. For logarithms, values below 0 are negative, but since it's a logarithm of a count that's never the case. For square roots values increase quickly between 0 and 1 and then slow down their overall increase greatly. Because ln(sp) / si will often have a value between 0 and 1 (anytime si > ln(sp) using a logarithm does not make sense because it would subtract from the value of the estimate and reduce the likelihood we explore that branch!

Algorithm to approximate non-linear equation system solution

I'm looking for an algorithm to approximate the solution of the following equation system:
The equations have to be solved on an embedded system, in C++.
Background:
We measure the 2 variables X_m and Y_m, so they are known
We want to compute the real values: X_r and Y_r
X and Y are real numbers
We measure the functions f_xy and f_yx during calibration. We have maximal 18 points of each function.
It's possible to store the functions as a look-up table
I tried to approximate the functions with 2nd order polynomials and compute the solution, but it was not accurate enough, because of the fitting error.
I am looking for an algorithm to approximate the results in an embedded system in C++, but I don't even know what to search for. I found some papers on the theory link, but I think there must be an easier way to do it in my case.
Also: how can I determine during calibration, whether the functions can be solved with the algorithm?

Fitting a second-order polynomial through f_xy? That's generally not viable. The go-to solution would be Runga-Kutta interpolation. You pick two known values left and two to the right of your argument, with weights 1,2,2,1. This gets you an estimate d(f_xy)/dx which you can then use for interpolation.

The normal way is by Newton's iterations, starting from the initial approximation (Xm, Ym) [assuming that the f are mere corrections]. Due to the particular shape of the equations, you can reduce to twice a single equation in a single unknown.
Xr = Xm - Fyx(Ym - Fxy(Xr))
Yr = Ym - Fxy(Xm - Fyx(Yr))
The iterations read
Xr <-- Xr - (Xm - Fyx(Ym - Fxy(Xr))) / (1 + Fxy'(Ym - Fxy(Xr)).Fxy'(Xr))
Yr <-- Yr - (Ym - Fxy(Xm - Fyx(Yr))) / (1 + Fyx'(Xm - Fyx(Yr)).Fyx'(Yr))
So you should tabulate the derivatives of f as well, though accuracy is not so critical than for the computation of the f themselves.
If the calibration points aren't too noisy, I would recommend cubic spline interpolation, for which you can precompute all coefficients. At the same time these coefficients allow you to estimate the derivative (as the corresponding quadratic interpolant, which is continuous).
In principle (unless the points are uniformly spaced), you need to perform a dichotomic search to determine the interval in which the argument lies. But here you will evaluate the functions at nearby values, so that a linear search from the previous location should be better.
A different way to address the problem is by considering the bivariate solution surfaces Xr = G(Xm, Ym) and Yr = G(Xm, Ym) that you compute on a grid of points. If the surfaces are smooth enough, you can use a coarse grid.
So by any method (such as the one above), you precompute the solutions at each grid node, as well as the coefficients of some interpolant in the X and Y directions. I recommend a cubic spline, again.
Now to interpolate inside a grid cell, you combine the two univarite interpolants to a bivariate one by means of the Coons formula https://en.wikipedia.org/wiki/Coons_patch.

Find correlation in large dataset

I have a huge dataset. We are talking about 100 3D matrices with 121x145x121 cells. Any cell has a value between 0 and 1, and I need a way to cluster these cells according to their correlation. The problem is the dataset is too big for any algorithm I know; even using just half of it (any matrix is a MRI scan of a brain) we have around 400 billion pairs. Any ideas?

As a first step I would be tempted to try K-means clustering.
This appears in the Matlab statistics toolbox as the function kmeans.
In this algorithm you only end up computing the distances between the K current centres and the data, so the number of pairs is much smaller than comparing all choices.
In Matlab, I've also found that the speed of the operation can be quite dependent on the organisation of your matrix (due to memory caching and optimisation issues). I would recommend transforming your 3d matrices so that the columns (held together in memory) correspond to the 100 values for a particular cell.
This can be done with the permute function.

Try a weighted K-means++ clustering algorithm. Create one matrix of the sum of values for all the 100 input matrices at every point to produce one "grey scale" matrix, then adjust the K-means++ algorithm to work with weighted, (wt), values.
In the initialization phase choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(X)^2 x wt^2 .
The assignment step should be okay, but when computing the centroids in the update step adjust the formula to account for the weights. (Or use the same formula but each point is used wt times).
You may not be able to use a library function to do this but you start with a 100 fold decrease in number of points and matrices to work with.

Why does FFT produce complex numbers instead of real numbers?

All the FFT implementations we have come across result in complex values (with real and imaginary parts), even if the input to the algorithm was a discrete set of real numbers (integers).
Is it not possible to represent frequency domain in terms of real numbers only?

The FFT is fundamentally a change of basis. The basis into which the FFT changes your original signal is a set of sine waves instead. In order for that basis to describe all the possible inputs it needs to be able to represent phase as well as amplitude; the phase is represented using complex numbers.
For example, suppose you FFT a signal containing only a single sine wave. Depending on phase you might well get an entirely real FFT result. But if you shift the phase of your input a few degrees, how else can the FFT output represent that input?
edit: This is a somewhat loose explanation, but I'm just trying to motivate the intuition.

The FFT provides you with amplitude and phase. The amplitude is encoded as the magnitude of the complex number (sqrt(x^2+y^2)) while the phase is encoded as the angle (atan2(y,x)). To have a strictly real result from the FFT, the incoming signal must have even symmetry (i.e. x[n]=conj(x[N-n])).
If all you care about is intensity, the magnitude of the complex number is sufficient for analysis.

Yes, it is possible to represent the FFT frequency domain results of strictly real input using only real numbers.
Those complex numbers in the FFT result are simply just 2 real numbers, which are both required to give you the 2D coordinates of a result vector that has both a length and a direction angle (or magnitude and a phase). And every frequency component in the FFT result can have a unique amplitude and a unique phase (relative to some point in the FFT aperture).
One real number alone can't represent both magnitude and phase. If you throw away the phase information, that could easily massively distort the signal if you try to recreate it using an iFFT (and the signal isn't symmetric). So a complete FFT result requires 2 real numbers per FFT bin. These 2 real numbers are bundled together in some FFTs in a complex data type by common convention, but the FFT result could easily (and some FFTs do) just produce 2 real vectors (one for cosine coordinates and one for sine coordinates).
There are also FFT routines that produce magnitude and phase directly, but they run more slowly than FFTs that produces a complex (or two real) vector result. There also exist FFT routines that compute only the magnitude and just throw away the phase information, but they usually run no faster than letting you do that yourself after a more general FFT. Maybe they save a coder a few lines of code at the cost of not being invertible. But a lot of libraries don't bother to include these slower and less general forms of FFT, and just let the coder convert or ignore what they need or don't need.
Plus, many consider the math involved to be a lot more elegant using complex arithmetic (where, for strictly real input, the cosine correlation or even component of an FFT result is put in the real component, and the sine correlation or odd component of the FFT result is put in the imaginary component of a complex number.)
(Added:) And, as yet another option, you can consider the two components of each FFT result bin, instead of as real and imaginary components, as even and odd components, both real.

If your FFT coefficient for a given frequency f is x + i y, you can look at x as the coefficient of a cosine at that frequency, while the y is the coefficient of the sine. If you add these two waves for a particular frequency, you will get a phase-shifted wave at that frequency; the magnitude of this wave is sqrt(x*x + y*y), equal to the magnitude of the complex coefficient.
The Discrete Cosine Transform (DCT) is a relative of the Fourier transform which yields all real coefficients. A two-dimensional DCT is used by many image/video compression algorithms.

The discrete Fourier transform is fundamentally a transformation from a vector of complex numbers in the "time domain" to a vector of complex numbers in the "frequency domain" (I use quotes because if you apply the right scaling factors, the DFT is its own inverse). If your inputs are real, then you can perform two DFTs at once: Take the input vectors x and y and calculate F(x + i y). I forget how you separate the DFT afterwards, but I suspect it's something about symmetry and complex conjugates.
The discrete cosine transform sort-of lets you represent the "frequency domain" with the reals, and is common in lossy compression algorithms (JPEG, MP3). The surprising thing (to me) is that it works even though it appears to discard phase information, but this also seems to make it less useful for most signal processing purposes (I'm not aware of an easy way to do convolution/correlation with a DCT).
I've probably gotten some details wrong ;)

The way you've phrased this question, I believe you are looking for a more intuitive way of thinking rather than a mathematical answer. I come from a mechanical engineering background and this is how I think about the Fourier transform. I contextualize the Fourier transform with reference to a pendulum. If we have only the x-velocity vs time of a pendulum and we are asked to estimate the energy of the pendulum (or the forcing source of the pendulum), the Fourier transform gives a complete answer. As usually what we are observing is only the x-velocity, we might conclude that the pendulum only needs to be provided energy equivalent to its sinusoidal variation of kinetic energy. But the pendulum also has potential energy. This energy is 90 degrees out of phase with the potential energy. So to keep track of the potential energy, we are simply keeping track of the 90 degree out of phase part of the (kinetic)real component. The imaginary part may be thought of as a 'potential velocity' that represents a manifestation of the potential energy that the source must provide to force the oscillatory behaviour. What is helpful is that this can be easily extended to the electrical context where capacitors and inductors also store the energy in 'potential form'. If the signal is not sinusoidal of course the transform is trying to decompose it into sinusoids. This I see as assuming that the final signal was generated by combined action of infinite sources each with a distinct sinusoid behaviour. What we are trying to determine is a strength and phase of each source that creates the final observed signal at each time instant.
PS: 1) The last two statements is generally how I think of the Fourier transform itself.
2) I say potential velocity rather the potential energy as the transform usually does not change dimensions of the original signal or physical quantity so it cannot shift from representing velocity to energy.

Short answer
Why does FFT produce complex numbers instead of real numbers?
The reason FT result is a complex array is a complex exponential multiplier is involved in the coefficients calculation. The final result is therefore complex. FT uses the multiplier to correlate the signal against multiple frequencies. The principle is detailed further down.
Is it not possible to represent frequency domain in terms of real numbers only?
Of course the 1D array of complex coefficients returned by FT could be represented by a 2D array of real values, which can be either the Cartesian coordinates x and y, or the polar coordinates r and θ (more here). However...
Complex exponential form is the most suitable form for signal processing
Having only real data is not so useful.
On one hand it is already possible to get these coordinates using one of the functions real, imag, abs and angle.
On the other hand such isolated information is of very limited interest. E.g. if we add two signals with the same amplitude and frequency, but in phase opposition, the result is zero. But if we discard the phase information, we just double the signal, which is totally wrong.
Contrary to a common belief, the use of complex numbers is not because such a number is a handy container which can hold two independent values. It's because processing periodic signals involves trigonometry all the time, and there is a simple way to move from sines and cosines to more simple complex numbers algebra: Euler's formula.
So most of the time signals are just converted to their complex exponential form. E.g. a signal with frequency 10 Hz, amplitude 3 and phase π/4 radians:
can be described by x = 3.ei(2π.10.t+π/4).
splitting the exponent: x = 3.ei.π/4 times ei.2π.10.t, t being the time.
The first number is a constant called the phasor. A common compact form is 3∠π/4. The second number is a time-dependent variable called the carrier.
This signal 3.ei.π/4 times ei.2π.10.t is easily plotted, either as a cosine (real part) or a sine (imaginary part):
from numpy import arange, pi, e, real, imag
t = arange(0, 0.2, 1/200)
x = 3 * e ** (1j*pi/4) * e ** (1j*2*pi*10*t)
ax1.stem(t, real(x))
ax2.stem(t, imag(x))
Now if we look at FT coefficients, we see they are phasors, they don't embed the frequency which is only dependent on the number of samples and the sampling frequency.
Actually if we want to plot a FT component in the time domain, we have to separately create the carrier from the frequency found, e.g. by calling fftfreq. With the phasor and the carrier we have the spectral component.
A phasor is a vector, and a vector can turn
Cartesian coordinates are extracted by using real and imag functions, the phasor used above, 3.e(i.π/4), is also the complex number 2.12 + 2.12j (i is j for scientists and engineers). These coordinates can be plotted on a plane with the vertical axis representing i (left):
This point can also represent a vector (center). Polar coordinates can be used in place of Cartesian coordinates (right). Polar coordinates are extracted by abs and angle. It's clear this vector can also represent the phasor 3∠π/4 (short form for 3.e(i.π/4))
This reminder about vectors is to introduce how phasors are manipulated. Say we have a real number of amplitude 1, which is not less than a complex which angle is 0 and also a phasor (x∠0). We also have a second phasor (3∠π/4), and we want the product of the two phasors. We could compute the result using Cartesian coordinates with some trigonometry, but this is painful. The easiest way is to use the complex exponential form:
we just add the angles and multiply the real coefficients: 1.e(i.0) times 3.e(i.π/4) = 1x3.ei(0+π/4) = 3.e(i.π/4)
we can just write: (1∠0) times (3∠π/4) = (3∠π/4).
Whatever, the result is this one:
The practical effect is to turn the real number and scale its magnitude. In FT, the real is the sample amplitude, and the multiplier magnitude is actually 1, so this corresponds to this operation, but the result is the same:
This long introduction was to explain the math behind FT.
How spectral coefficients are created by FT
FT principle is, for each spectral coefficient to compute:
to multiply each of the samples amplitudes by a different phasor, so that the angle is increasing from the first sample to the last,
to sum all the previous products.
If there are N samples xn (0 to N-1), there are N spectral coefficients Xk to compute. Calculation of coefficient Xk involves multiplying each sample amplitude xn by the phasor e-i2πkn/N and taking the sum, according to FT equation:
In the N individual products, the multiplier angle varies according to 2π.n/N and k, meaning the angle changes, ignoring k for now, from 0 to 2π. So while performing the products, we multiply a variable real amplitude by a phasor which magnitude is 1 and angle is going from 0 to a full round. We know this multiplication turns and scales the real amplitude:
Source: A. Dieckmann from Physikalisches Institut der Universität Bonn
Doing this summation is actually trying to correlate the signal samples to the phasor angular velocity, which is how fast its angle varies with n/N. The result tells how strong this correlation is (amplitude), and how much synchroneous it is (phase).
This operation is repeated for the k spectral coefficients to compute (half with k negative, half with k positive). As k changes, the angle increment also varies, so the correlation is checked against another frequency.
Conclusion
FT results are neither sines nor cosines, they are not waves, they are phasors describing a correlation. A phasor is a constant, expressed as a complex exponential, embedding both amplitude and phase. Multiplied by a carrier, which is also a complex exponential, but variable, dependent on time, they draw helices in time domain:
Source
When these helices are projected onto the horizontal plane, this is done by taking the real part of the FT result, the function drawn is the cosine. When projected onto the vertical plane, which is done by taking the imaginary part of the FT result, the function drawn is the sine. The phase determines at which angle the helix starts and therefore without the phase, the signal cannot be reconstructed using an inverse FT.
The complex exponential multiplier is a tool to transform the linear velocity of amplitude variations into angular velocity, which is frequency times 2π. All that revolves around Euler's formula linking sinusoid and complex exponential.

For a signal with only cosine waves, fourier transform, aka. FFT produces completely real output. For a signal composed of only sine waves, it produces completely imaginary output. A phase shift in any of the signals will result in a mix of real and complex. Complex numbers (in this context) are merely another way to store phase and amplitude.

Nearest neighbors in high-dimensional data?

I have asked a question a few days back on how to find the nearest neighbors for a given vector. My vector is now 21 dimensions and before I proceed further, because I am not from the domain of Machine Learning nor Math, I am beginning to ask myself some fundamental questions:
Is Euclidean distance a good metric for finding the nearest neighbors in the first place? If not, what are my options?
In addition, how does one go about deciding the right threshold for determining the k-neighbors? Is there some analysis that can be done to figure this value out?
Previously, I was suggested to use kd-Trees but the Wikipedia page clearly says that for high-dimensions, kd-Tree is almost equivalent to a brute-force search. In that case, what is the best way to find nearest-neighbors in a million point dataset efficiently?
Can someone please clarify the some (or all) of the above questions?

I currently study such problems -- classification, nearest neighbor searching -- for music information retrieval.
You may be interested in Approximate Nearest Neighbor (ANN) algorithms. The idea is that you allow the algorithm to return sufficiently near neighbors (perhaps not the nearest neighbor); in doing so, you reduce complexity. You mentioned the kd-tree; that is one example. But as you said, kd-tree works poorly in high dimensions. In fact, all current indexing techniques (based on space partitioning) degrade to linear search for sufficiently high dimensions [1][2][3].
Among ANN algorithms proposed recently, perhaps the most popular is Locality-Sensitive Hashing (LSH), which maps a set of points in a high-dimensional space into a set of bins, i.e., a hash table [1][3]. But unlike traditional hashes, a locality-sensitive hash places nearby points into the same bin.
LSH has some huge advantages. First, it is simple. You just compute the hash for all points in your database, then make a hash table from them. To query, just compute the hash of the query point, then retrieve all points in the same bin from the hash table.
Second, there is a rigorous theory that supports its performance. It can be shown that the query time is sublinear in the size of the database, i.e., faster than linear search. How much faster depends upon how much approximation we can tolerate.
Finally, LSH is compatible with any Lp norm for 0 < p <= 2. Therefore, to answer your first question, you can use LSH with the Euclidean distance metric, or you can use it with the Manhattan (L1) distance metric. There are also variants for Hamming distance and cosine similarity.
A decent overview was written by Malcolm Slaney and Michael Casey for IEEE Signal Processing Magazine in 2008 [4].
LSH has been applied seemingly everywhere. You may want to give it a try.
[1] Datar, Indyk, Immorlica, Mirrokni, "Locality-Sensitive Hashing Scheme Based on p-Stable Distributions," 2004.
[2] Weber, Schek, Blott, "A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces," 1998.
[3] Gionis, Indyk, Motwani, "Similarity search in high dimensions via hashing," 1999.
[4] Slaney, Casey, "Locality-sensitive hashing for finding nearest neighbors", 2008.

I. The Distance Metric
First, the number of features (columns) in a data set is not a factor in selecting a distance metric for use in kNN. There are quite a few published studies directed to precisely this question, and the usual bases for comparison are:
the underlying statistical
distribution of your data;
the relationship among the features
that comprise your data (are they
independent--i.e., what does the
covariance matrix look like); and
the coordinate space from which your
data was obtained.
If you have no prior knowledge of the distribution(s) from which your data was sampled, at least one (well documented and thorough) study concludes that Euclidean distance is the best choice.
YEuclidean metric used in mega-scale Web Recommendation Engines as well as in current academic research. Distances calculated by Euclidean have intuitive meaning and the computation scales--i.e., Euclidean distance is calculated the same way, whether the two points are in two dimension or in twenty-two dimension space.
It has only failed for me a few times, each of those cases Euclidean distance failed because the underlying (cartesian) coordinate system was a poor choice. And you'll usually recognize this because for instance path lengths (distances) are no longer additive--e.g., when the metric space is a chessboard, Manhattan distance is better than Euclidean, likewise when the metric space is Earth and your distances are trans-continental flights, a distance metric suitable for a polar coordinate system is a good idea (e.g., London to Vienna is is 2.5 hours, Vienna to St. Petersburg is another 3 hrs, more or less in the same direction, yet London to St. Petersburg isn't 5.5 hours, instead, is a little over 3 hrs.)
But apart from those cases in which your data belongs in a non-cartesian coordinate system, the choice of distance metric is usually not material. (See this blog post from a CS student, comparing several distance metrics by examining their effect on kNN classifier--chi square give the best results, but the differences are not large; A more comprehensive study is in the academic paper, Comparative Study of Distance Functions for Nearest Neighbors--Mahalanobis (essentially Euclidean normalized by to account for dimension covariance) was the best in this study.
One important proviso: for distance metric calculations to be meaningful, you must re-scale your data--rarely is it possible to build a kNN model to generate accurate predictions without doing this. For instance, if you are building a kNN model to predict athletic performance, and your expectation variables are height (cm), weight (kg), bodyfat (%), and resting pulse (beats per minute), then a typical data point might look something like this: [ 180.4, 66.1, 11.3, 71 ]. Clearly the distance calculation will be dominated by height, while the contribution by bodyfat % will be almost negligible. Put another way, if instead, the data were reported differently, so that bodyweight was in grams rather than kilograms, then the original value of 86.1, would be 86,100, which would have a large effect on your results, which is exactly what you don't want. Probably the most common scaling technique is subtracting the mean and dividing by the standard deviation (mean and sd refer calculated separately for each column, or feature in that data set; X refers to an individual entry/cell within a data row):
X_new = (X_old - mu) / sigma
II. The Data Structure
If you are concerned about performance of the kd-tree structure, A Voronoi Tessellation is a conceptually simple container but that will drastically improve performance and scales better than kd-Trees.
This is not the most common way to persist kNN training data, though the application of VT for this purpose, as well as the consequent performance advantages, are well-documented (see e.g. this Microsoft Research report). The practical significance of this is that, provided you are using a 'mainstream' language (e.g., in the TIOBE Index) then you ought to find a library to perform VT. I know in Python and R, there are multiple options for each language (e.g., the voronoi package for R available on CRAN)
Using a VT for kNN works like this::
From your data, randomly select w points--these are your Voronoi centers. A Voronoi cell encapsulates all neighboring points that are nearest to each center. Imagine if you assign a different color to each of Voronoi centers, so that each point assigned to a given center is painted that color. As long as you have a sufficient density, doing this will nicely show the boundaries of each Voronoi center (as the boundary that separates two colors.
How to select the Voronoi Centers? I use two orthogonal guidelines. After random selecting the w points, calculate the VT for your training data. Next check the number of data points assigned to each Voronoi center--these values should be about the same (given uniform point density across your data space). In two dimensions, this would cause a VT with tiles of the same size.That's the first rule, here's the second. Select w by iteration--run your kNN algorithm with w as a variable parameter, and measure performance (time required to return a prediction by querying the VT).
So imagine you have one million data points..... If the points were persisted in an ordinary 2D data structure, or in a kd-tree, you would perform on average a couple million distance calculations for each new data points whose response variable you wish to predict. Of course, those calculations are performed on a single data set. With a V/T, the nearest-neighbor search is performed in two steps one after the other, against two different populations of data--first against the Voronoi centers, then once the nearest center is found, the points inside the cell corresponding to that center are searched to find the actual nearest neighbor (by successive distance calculations) Combined, these two look-ups are much faster than a single brute-force look-up. That's easy to see: for 1M data points, suppose you select 250 Voronoi centers to tesselate your data space. On average, each Voronoi cell will have 4,000 data points. So instead of performing on average 500,000 distance calculations (brute force), you perform far lesss, on average just 125 + 2,000.
III. Calculating the Result (the predicted response variable)
There are two steps to calculating the predicted value from a set of kNN training data. The first is identifying n, or the number of nearest neighbors to use for this calculation. The second is how to weight their contribution to the predicted value.
W/r/t the first component, you can determine the best value of n by solving an optimization problem (very similar to least squares optimization). That's the theory; in practice, most people just use n=3. In any event, it's simple to run your kNN algorithm over a set of test instances (to calculate predicted values) for n=1, n=2, n=3, etc. and plot the error as a function of n. If you just want a plausible value for n to get started, again, just use n = 3.
The second component is how to weight the contribution of each of the neighbors (assuming n > 1).
The simplest weighting technique is just multiplying each neighbor by a weighting coefficient, which is just the 1/(dist * K), or the inverse of the distance from that neighbor to the test instance often multiplied by some empirically derived constant, K. I am not a fan of this technique because it often over-weights the closest neighbors (and concomitantly under-weights the more distant ones); the significance of this is that a given prediction can be almost entirely dependent on a single neighbor, which in turn increases the algorithm's sensitivity to noise.
A must better weighting function, which substantially avoids this limitation is the gaussian function, which in python, looks like this:
def weight_gauss(dist, sig=2.0) :
return math.e**(-dist**2/(2*sig**2))
To calculate a predicted value using your kNN code, you would identify the n nearest neighbors to the data point whose response variable you wish to predict ('test instance'), then call the weight_gauss function, once for each of the n neighbors, passing in the distance between each neighbor the the test point.This function will return the weight for each neighbor, which is then used as that neighbor's coefficient in the weighted average calculation.

What you are facing is known as the curse of dimensionality. It is sometimes useful to run an algorithm like PCA or ICA to make sure that you really need all 21 dimensions and possibly find a linear transformation which would allow you to use less than 21 with approximately the same result quality.
Update:
I encountered them in a book called Biomedical Signal Processing by Rangayyan (I hope I remember it correctly). ICA is not a trivial technique, but it was developed by researchers in Finland and I think Matlab code for it is publicly available for download. PCA is a more widely used technique and I believe you should be able to find its R or other software implementation. PCA is performed by solving linear equations iteratively. I've done it too long ago to remember how. = )
The idea is that you break up your signals into independent eigenvectors (discrete eigenfunctions, really) and their eigenvalues, 21 in your case. Each eigenvalue shows the amount of contribution each eigenfunction provides to each of your measurements. If an eigenvalue is tiny, you can very closely represent the signals without using its corresponding eigenfunction at all, and that's how you get rid of a dimension.

Top answers are good but old, so I'd like to add up a 2016 answer.
As said, in a high dimensional space, the curse of dimensionality lurks around the corner, making the traditional approaches, such as the popular k-d tree, to be as slow as a brute force approach. As a result, we turn our interest in Approximate Nearest Neighbor Search (ANNS), which in favor of some accuracy, speedups the process. You get a good approximation of the exact NN, with a good propability.
Hot topics that might be worthy:
Modern approaches of LSH, such as Razenshteyn's.
RKD forest: Forest(s) of Randomized k-d trees (RKD), as described in FLANN,
or in a more recent approach I was part of, kd-GeRaF.
LOPQ which stands for Locally Optimized Product Quantization, as described here. It is very similar to the new Babenko+Lemptitsky's approach.
You can also check my relevant answers:
Two sets of high dimensional points: Find the nearest neighbour in the other set
Comparison of the runtime of Nearest Neighbor queries on different data structures
PCL kd-tree implementation extremely slow

To answer your questions one by one:
No, euclidean distance is a bad metric in high dimensional space. Basically in high dimensions, data points have large differences between each other. That decreases the relative difference in the distance between a given data point and its nearest and farthest neighbour.
Lot of papers/research are there in high dimension data, but most of the stuff requires a lot of mathematical sophistication.
KD tree is bad for high dimensional data ... avoid it by all means
Here is a nice paper to get you started in the right direction. "When in Nearest Neighbour meaningful?" by Beyer et all.
I work with text data of dimensions 20K and above. If you want some text related advice, I might be able to help you out.

Cosine similarity is a common way to compare high-dimension vectors. Note that since it's a similarity not a distance, you'd want to maximize it not minimize it. You can also use a domain-specific way to compare the data, for example if your data was DNA sequences, you could use a sequence similarity that takes into account probabilities of mutations, etc.
The number of nearest neighbors to use varies depending on the type of data, how much noise there is, etc. There are no general rules, you just have to find what works best for your specific data and problem by trying all values within a range. People have an intuitive understanding that the more data there is, the fewer neighbors you need. In a hypothetical situation where you have all possible data, you only need to look for the single nearest neighbor to classify.
The k Nearest Neighbor method is known to be computationally expensive. It's one of the main reasons people turn to other algorithms like support vector machines.

kd-trees indeed won't work very well on high-dimensional data. Because the pruning step no longer helps a lot, as the closest edge - a 1 dimensional deviation - will almost always be smaller than the full-dimensional deviation to the known nearest neighbors.
But furthermore, kd-trees only work well with Lp norms for all I know, and there is the distance concentration effect that makes distance based algorithms degrade with increasing dimensionality.
For further information, you may want to read up on the curse of dimensionality, and the various variants of it (there is more than one side to it!)
I'm not convinced there is a lot use to just blindly approximating Euclidean nearest neighbors e.g. using LSH or random projections. It may be necessary to use a much more fine tuned distance function in the first place!

A lot depends on why you want to know the nearest neighbors. You might look into the mean shift algorithm http://en.wikipedia.org/wiki/Mean-shift if what you really want is to find the modes of your data set.

I think cosine on tf-idf of boolean features would work well for most problems. That's because its time-proven heuristic used in many search engines like Lucene. Euclidean distance in my experience shows bad results for any text-like data. Selecting different weights and k-examples can be done with training data and brute-force parameter selection.

iDistance is probably the best for exact knn retrieval in high-dimensional data. You can view it as an approximate Voronoi tessalation.

I've experienced the same problem and can say the following.
Euclidean distance is a good distance metric, however it's computationally more expensive than the Manhattan distance, and sometimes yields slightly poorer results, thus, I'd choose the later.
The value of k can be found empirically. You can try different values and check the resulting ROC curves or some other precision/recall measure in order to find an acceptable value.
Both Euclidean and Manhattan distances respect the Triangle inequality, thus you can use them in metric trees. Indeed, KD-trees have their performance severely degraded when the data have more than 10 dimensions (I've experienced that problem myself). I found VP-trees to be a better option.

KD Trees work fine for 21 dimensions, if you quit early,
after looking at say 5 % of all the points.
FLANN does this (and other speedups)
to match 128-dim SIFT vectors. (Unfortunately FLANN does only the Euclidean metric,
and the fast and solid
scipy.spatial.cKDTree
does only Lp metrics;
these may or may not be adequate for your data.)
There is of course a speed-accuracy tradeoff here.
(If you could describe your Ndata, Nquery, data distribution,
that might help people to try similar data.)
Added 26 April, run times for cKDTree with cutoff on my old mac ppc, to give a very rough idea of feasibility:
kdstats.py p=2 dim=21 N=1000000 nask=1000 nnear=2 cutoff=1000 eps=0 leafsize=10 clustype=uniformp
14 sec to build KDtree of 1000000 points
kdtree: 1000 queries looked at av 0.1 % of the 1000000 points, 0.31 % of 188315 boxes; better 0.0042 0.014 0.1 %
3.5 sec to query 1000 points
distances to 2 nearest: av 0.131 max 0.253
kdstats.py p=2 dim=21 N=1000000 nask=1000 nnear=2 cutoff=5000 eps=0 leafsize=10 clustype=uniformp
14 sec to build KDtree of 1000000 points
kdtree: 1000 queries looked at av 0.48 % of the 1000000 points, 1.1 % of 188315 boxes; better 0.0071 0.026 0.5 %
15 sec to query 1000 points
distances to 2 nearest: av 0.131 max 0.245

You could try a z order curve. It's easy for 3 dimension.

I had a similar question a while back. For fast Approximate Nearest Neighbor Search you can use the annoy library from spotify: https://github.com/spotify/annoy
This is some example code for the Python API, which is optimized in C++.
from annoy import AnnoyIndex
import random
f = 40
t = AnnoyIndex(f, 'angular') # Length of item vector that will be indexed
for i in range(1000):
v = [random.gauss(0, 1) for z in range(f)]
t.add_item(i, v)
t.build(10) # 10 trees
t.save('test.ann')
# ...
u = AnnoyIndex(f, 'angular')
u.load('test.ann') # super fast, will just mmap the file
print(u.get_nns_by_item(0, 1000)) # will find the 1000 nearest neighbors
They provide different distance measurements. Which distance measurement you want to apply depends highly on your individual problem. Also consider prescaling (meaning weighting) certain dimensions for importance first. Those dimension or feature importance weights might be calculated by something like entropy loss or if you have a supervised learning problem gini impurity gain or mean average loss, where you check how much worse your machine learning model performs, if you scramble this dimensions values.
Often the direction of the vector is more important than it's absolute value. For example in the semantic analysis of text documents, where we want document vectors to be close when their semantics are similar, not their lengths. Thus we can either normalize those vectors to unit length or use angular distance (i.e. cosine similarity) as a distance measurement.
Hope this is helpful.

Is Euclidean distance a good metric for finding the nearest neighbors in the first place? If not, what are my options?
I would suggest soft subspace clustering, a pretty common approach nowadays, where feature weights are calculated to find the most relevant dimensions. You can use these weights when using euclidean distance, for example. See curse of dimensionality for common problems and also this article can enlighten you somehow:
A k-means type clustering algorithm for subspace clustering of mixed numeric and
categorical datasets

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio