Machine Learning Algorithm for Completing Sparse Matrix Data

Machine Learning Algorithm for Completing Sparse Matrix Data - algorithm

I've seen some machine learning questions on here so I figured I would post a related question:
Suppose I have a dataset where athletes participate at running competitions of 10 km and 20 km with hilly courses i.e. every competition has its own difficulty.
The finishing times from users are almost inverse normally distributed for every competition.
One can write this problem as a matrix:
Comp1 Comp2 Comp3
User1 20min ?? 10min
User2 25min 20min 12min
User3 30min 25min ??
User4 30min ?? ??
I would like to complete the matrix above which has the size 1000x20 and a sparseness of 8 % (!).
There should be a very easy way to complete this matrix, since I can calculate parameters for every user (ability) and parameters for every competition (mu, lambda of distributions). Moreover the correlation between the competitions are very high.
I can take advantage of the rankings User1 < User2 < User3 and Item3 << Item2 < Item1
Could you maybe give me a hint which methods I could use?

Your astute observation that this is a matrix completion problem gets
you most of the way to the solution. I'll codify your intuition that
the combination of ability of a user and difficulty of the course
yields the time of a race, then present various algorithms.
Model
Let the vector u denote the speed of the users so that u_i is user i's
speed. Let the vector v denote the difficulty of the courses so
that v_j is course j's difficulty. Also when available, let t_ij be user i's time on
course j, and define y_ij = 1/t_ij, user i's speed on course j.
Since you say the times are inverse Gaussian distributed, a sensible
model for the observations is
y_ij = u_i * v_j + e_ij,
where e_ij is a zero-mean Gaussian random variable.
To fit this model, we search for vectors u and v that minimize the
prediction error among the observed speeds:
f(u,v) = sum_ij (u_i * v_j - y_ij)^2
Algorithm 1: missing value Singular Value Decomposition
This is the classical Hebbian
algorithm. It
minimizes the above cost function by gradient descent. The gradient of
f wrt to u and v are
df/du_i = sum_j (u_i * v_j - y_ij) v_j
df/dv_j = sum_i (u_i * v_j - y_ij) u_i
Plug these gradients into a Conjugate Gradient solver or BFGS
optimizer, like MATLAB's fmin_unc or scipy's optimize.fmin_ncg or
optimize.fmin_bfgs. Don't roll your own gradient descent unless you're willing to implement a very good line search algorithm.
Algorithm 2: matrix factorization with a trace norm penalty
Recently, simple convex relaxations to this problem have been
proposed. The resulting algorithms are just as simple to code up and seem to
work very well. Check out, for example Collaborative Filtering in a Non-Uniform World:
Learning with the Weighted Trace Norm. These methods minimize
f(m) = sum_ij (m_ij - y_ij)^2 + ||m||_*,
where ||.||_* is the so-called nuclear norm of the matrix m. Implementations will end up again computing gradients with respect to u and v and relying on a nonlinear optimizer.

There are several ways to do this, perhaps the best architecture to try first is the following:
(As usual, as a preprocessing step normalize your data into a uniform function with 0 mean and 1 std deviation as best you can. You can do this by fitting a function to the distribution of all race results, applying its inverse, and then subtracting the mean and dividing by the std deviation.)
Select a hyperparameter N (you can tune this as usual with a cross validation set).
For each participant and each race create an N-dimensional feature vector, initially random. So if there are R races and P participants then there are R+P feature vectors with a total of N(R+P) parameters.
The prediction for a given participant and a given race is a function of the two corresponding feature vectors (as a first try use the scalar product of these two vectors).
Alternate between incrementally improving the participant feature vectors and the race feature vectors.
To improve a feature vector use gradient descent (or some more complex optimization method) on the known data elements (the participant/race pairs for which you have a result).
That is your loss function is:
total_error = 0
forall i,j
if (Participant i participated in Race j)
actual = ActualRaceResult(i,j)
predicted = ScalarProduct(ParticipantFeatures_i, RaceFeatures_j)
total_error += (actual - predicted)^2
So calculate the partial derivative of this function wrt the feature vectors and adjust them incrementally as per a usual ML algorithm.
(You should also include a regularization term on the loss function, for example square of the lengths of the feature vectors)
Let me know if this architecture is clear to you or you need further elaboration.

I think this is a classical task of missing data recovery. There exist some different methods. One of them which I can suggest is based on Self Organizing Feature Map (Kohonen's Map).
Below it's assumed that every athlet record is a pattern, and every competition data is a feature.
Basically, you should divide your data into 2 sets: first - with fully defined patterns, and second - patterns with partially lost features. I assume this is eligible because sparsity is 8%, that is you have enough data (92%) to train net on undamaged records.
Then you feed first set to the SOM and train it on this data. During this process all features are used. I'll not copy algorithm here, because it can be found in many public sources, and even some implementations are available.
After the net is trained, you can feed patterns from the second set to the net. For each pattern the net should calculate best matching unit (BMU), based only on those features that exist in the current pattern. Then you can take from the BMU its weigths, corresponding to missing features.
As alternative, you could not divide the whole data into 2 sets, but train the net on all patterns including the ones with missing features. But for such patterns learning process should be altered in the similar way, that is BMU should be calculated only on existing features in every pattern.

I think you can have a look at the recent low rank matrix completion methods.
The assumption is that your matrix has a low rank compared to the matrix dimension.
min rank(M)
s.t. ||P(M-M')||_F=0
M is the final result, and M' is the uncompleted matrix you currently have.
This algorithm minimizes the rank of your matrix M. P in the constraint is an operator that takes the known terms of your matrix M', and constraint those terms in M to be the same as in M'.
The optimization of this problem has a relaxed version, which is:
min ||M||_* + \lambda*||P(M-M')||_F
rank(M) is relaxed to its convex hull ||M||_* Then you trade off the two terms by controlling the parameter lambda.

Related

Inverse inference on bayesian piece wise linear regression model in pymc

I am trying to perform inverse inference on a simple bayesian network for piece wise linear regression. That is, y is a piece wise linear function of x :Plot of Y vs X
and the Bayesian network looks like this: Bayesian Network Model
Here, X has a normal distribution, K is a discrete node that has a softmax distribution conditioned on X and Y is a mixture of linear gaussians based on the value of K (i.e. Pr(Y | K=i, X=x) ~ N(mu=w_i*x+b_i, s_i)).
I have learned the parameters of this model using EM algorithm. (The actual relationship of Y and X has five linear pieces, but I have learnt using 8 levels for the discrete node). And formed the pymc model using those parameters. Here is the code:
x=pymc.Normal('x', mu=0.5, tau=1.0/0.095)
#The probabilities of discrete node given x=x; Softmax distribution
epower = [-11.818,54.450,29.270,-13.038,73.541,28.466,-57.530,-101.568]
bias = [7.8228,-35.3859,-12.9512,12.8004,-48.1097,-13.2229,30.6079,39.3811]
#pymc.deterministic(plot=False)
def prob(epower=epower,bias=bias,x=x):
pr=[np.exp(ep*x+bb) for ep, bb in zip(epower, bias)]
return [pri/np.sum(pr) for pri in pr]
knode=pymc.Categorical('knode', p=prob)
#The weights of regression
wtsY=[15.022, -70.000, -14.996, 15.026, -70.000, -14.996, 34.937, 15.027]
#The unconditional means of Y
meansY=[5.9881,68.0000,23.9973,5.9861,68.0000,23.9972,-1.9809,1.9982]
sigmasY=[0.010189,0.010000,0.010033,0.010211,0.010000,0.010036,0.010380,0.010167]
#pymc.deterministic(plot=False)
def condmeanY(knode=knode, x=x,wtsY=wtsY, meansY=meansY):
return wtsY[knode]*x + meansY[knode]
#pymc.deterministic(plot=False)
def condsigmaY(knode=knode, sigmasY=sigmasY):
return sigmasY[knode]
y=pymc.Normal('y', mu=condmeanY, tau=1.0/condsigmaY, value=13.5, observed=True)
I want to predict x, when y is observed (inverse inference). As y is (approximately) non-linear in x, there will be multiple solutions for a given value of y. I expect that the obtained trace of x should show those multiple solutions. I have ensured that autocorrelation is very low (sample=2000, burn=1000). But I am not able to see multiple solutions. In the above example, for y=13.5, there are two possible solutions, x=0.5 and x=0.7. But the chain only wanders near 0.5. The histogram has only one peak, at 0.5.
Am I missing something?
EDIT: I came across this very relevant question:Solving inverse problems with PyMC. What I learned from the answer is that the prior of x, which I am assuming to be uni-modal Gaussian here, should have a non-parametric distribution and then the obtained samples after first iteration can be used to update it. Kernel density estimation (with gaussian kernel) has been suggested to obtain non-parametric stochastic from data. I incorporated this in my model but still there is no difference. One thing I noted is that if I do the inference multiple times, approx 50% of the times, I get 0.5, and 50% of the times, I get 0.7 (I am not sure if this was the case earlier as well, because I had not run that model many times to observe this.) But still, should I not see two peaks in the trace after first iteration only?
I also tried with a modified version of this model, where the edge from X to K is reversed. This is a classical conditional linear Gaussian model. Even with this model, I could not get multiple solutions visible in the trace. I am sort of stuck here. Please help.

What algorithm do I use to calculate voltage across a combination circuit?

I'm trying to programmatically calculate voltage changes over a very large circuit.
*This question may seem geared toward electronics, but it's
more about applying an algorithm over a set of data.
To keep things simple,
here is a complete circuit, with the voltages already calculated:
I'm originally only given the battery voltage and the resistances:
The issue I have is that voltage is calculated differently among parallel and series circuits.
A somewhat similar question asked on SO.
Some formulas:
When resistors are in parallel:
Rtotal = 1/(1/R1 + 1/R2 + 1/R3 ... + 1/Rn)
When resistors are in series:
Rtotal = R1 + R2 + R3 ... + Rn
Ohm's Law:
V = IR
I = V/R
R = V/I
V is voltage (volts)
I is current (amps)
R is resistance(ohms)
Every tutorial I've found on the internet consists of people conceptually grouping together parallel circuits to get the total resistance, and then using that resistance to calculate the resistance in series.
This is fine for small examples, but it's difficult to derive an algorithm out of it for large scale circuits.
My question:
Given a matrix of all complete paths,
is there a way for me to calculate all the voltage drops?
I currently have the system as a graph data structure.
All of the nodes are represented(and can be looked up by) an id number.
So for the example above, if I run the traversals, I'll get back a list of paths like this:
[[0,1,2,4,0]
,[0,1,3,4,0]]
Each number can be used to derive the actual node and it's corresponding data. What kind of transformations/algorithms do I need to perform on this set of data?
It's very likely that portions of the circuit will be compound, and those compound sections may find themselves being in parallel or series with other compound sections.
I think my problem is akin to this:
http://en.wikipedia.org/wiki/Series-parallel_partial_order

Some circuits cannot even be analyzed in terms of series and parallel, for example a circuit which includes the edges of a cube (there's some code at the bottom of that web page that might be helpful; I haven't looked at it). Another example that can't be analyzed into series/parallel is a pentagon/pentagram shape.
A more robust solution than thinking about series and parallel is to use Kirchhoff's laws.
You need to make variables for the currents in each linear section
of the circuit.
Apply Kirchhoff's current law (KCL) to nodes where
linear sections meet.
Apply Kirchhoff's voltage law (KVL) to as many
cycles as you can find.
Use Gaussian elimination to solve the
resulting linear system of equations.
The tricky part is identifying cycles. In the example you give, there are three cycles: through battery and left resistor, battery and right resistor, and through left and right resistors. For planar circuits it's not too hard to find a complete set of cycles; for three dimensional circuits, it can be hard.
You don't actually need all the cycles. In the above example, two would be enough (corresponding to the two bounded regions into which the circuit divides the plane). Then you have three variables (currents in three linear parts of the circuit) and three equations (sum of currents at the top node where three linear segments meet, and voltage drops around two cycles). That is enough to solve the system for currents by Gaussian elimination, then you can calculate voltages from the currents.
If you throw in too many equations (e.g., currents at both nodes in your example, and voltages over three cycles instead of two), things will still work out: Gaussian elimination will just eliminate the redundancies and you'll still get the unique, correct answer. The real problem is if you have too few equations. For example, if you use KCL on the two nodes in your example and KVL around just one cycle, you'll have three equations, but one is redundant, so you'll only really have two independent equations, which is not enough. So I would say throw in every equation you can find and let Gaussian elimination sort it out.
And hopefully you can restrict to planar circuits, for which it is easy to find a nice set of cycles. Otherwise you'll need a graph cycle enumeration algorithm. I'm sure you can find one if you need it.

use a maximum flow algorithm (Dijkstra is your friend).
http://www.cs.princeton.edu/courses/archive/spr04/cos226/lectures/maxflow.4up.pdf
You pretend to be in front of a water flow problem (well, actually it IS a flow problem). You have to compute the flow of water on each segment (the current). Then you can easily compute the voltage drop (water pressure) across every resistor.

I think the way to go here would be something like this:
Sort all your paths into groups of the same length.
While there are more than one group, choose the group with the largest length and:
2a. Find two paths with one item difference.
2b. "Merge" them into a path with the length smaller by one - the merge is dependent on the actual items that are different.
2c. Add the new path into the relevant group.
2d. If there are only paths with more than one item difference, merge the different items so that you have only one different item between the paths.
2e. When there is only one item left, find an item from a "lower" (= length is smaller) with minimum differences, and merge item to match.
When there is one group left with more than one item, keep doing #2 until there is one group left with one item.
Calculate the value of that item directly.
This is very initial, but I think the main idea is clear.
Any improvements are welcome.

Single Pass Seed Selection Algorithm for k-Means

I've recently read the Single Pass Seed Selection Algorithm for k-Means article, but not really understand the algorithm, which is:
Calculate distance matrix Dist in which Dist (i,j) represents distance from i to j
Find Sumv in which Sumv (i) is the sum of the distances from ith point to all other points.
Find the point i which is min (Sumv) and set Index = i
Add First to C as the first centroid
For each point xi, set D (xi) to be the distance between xi and the nearest point in C
Find y as the sum of distances of first n/k nearest points from the Index
Find the unique integer i so that D(x1)^2+D(x2)^2+...+D(xi)^2 >= y > D(x1)^2+D(x2)^2+...+D(x(i-1))^2
Add xi to C
Repeat steps 5-8 until k centers
Especially step 6, do we still use the same Index (same point) over and over or we use the newly added point from C? And about step 8, does i have to be larger than 1?

Honestly, I wouldn't worry about understanding that paper - its not very good.
The algorithm is poorly described.
Its not actually a single pass, it needs do to n^2/2 pairwise computations + one additional pass through the data.
They don't report the runtime of their seed selection scheme, probably because it is very bad doing O(n^2) work.
They are evaluating on very simple data sets that don't have a lot of bad solutions for k-Means to fall into.
One of their metrics of "better"ness is how many iterations it takes k-means to run given the seed selection. While it is an interesting metric, the small differences they report are meaningless (k-means++ seeding could be more iterations, but less work done per iteration), and they don't report the run time or which k-means algorithm they use.
You will get a lot more benefit from learning and understanding the k-means++ algorithm they are comparing against, and reading some of the history from that.
If you really want to understand what they are doing, I would brush up on your matlab and read their provided matlab code. But its not really worth it. If you look up the quantile seed selection algorithm, they are essentially doing something very similar. Instead of using the distance to the first seed to sort the points, they appear to be using the sum of pairwise distances (which means they don't need an initial seed, hence the unique solution).

Single Pass Seed Selection algorithm is a novel algorithm. Single Pass mean that without any iterations first seed can be selected. k-means++ performance is depends on first seed. It is overcome in SPSS. Please gothrough the paper "Robust Seed Selestion Algorithm for k-means" from the same authors
John J. Louis

Graph Theory: Calculating Clustering Coefficient

I'm doing some research and I've come to a point where I have calculate the clustering coefficient of a graph.
According to this paper directly related to my research:
The clustering coefﬁcient C(p) is
deﬁned as follows. Suppose that a
vertex v has kv neighbours; then at
most (kv * (kv-1)) / 2 edges can
exist between them (this occurs when
every neighbour of v is connected to
every other neighbour of v). Let Cv
denote the fraction of these allowable
edges that actually exist. Deﬁne C as
the average of Cv over all v
But this wikipedia article on the subject says differently:
C = (number of closed triplets) / (number of connected triples)
It seems to me that the latter is more computationally expensive.
So really my question is: are they equivalent?
It should be noted that the paper is cited by the Wikipedia article.
Thanks for your time.

The two formulas are not the same; they are two different ways in which the global clustering coefficient can be calculated.
One way is by averaging the clustering coefficients (C_i [1]) of all nodes (this is the method you quoted from Watts and Strogatz). However, in [2, p204] Newman argues that this method is less preferable than the second one (the one you got from wikipedia). He justifies by pointing how the value of the global clustering coeff can be dominated by nodes of low degree, due to C_i's denominator [1]. So, in a network with many nodes of low degrees, you end up with a large value for the global clustering coeff, which Newman argues would be unrepresentative.
However, many network studies (or, in my experience, at least many studies concerned with online social networks) seem to have used this method, so in order to be able to compare your results with theirs, you would require to use the same method. Furthermore, the critique raised by Newman does not affect the extent to which comparisons of global clustering coefficients can be made, provided the same method was used in measuring them.
The two formulae are different and were proposed at different moments in time. The one you quoted from Watt and Strogatz is older, which is perhaps why it seems to have been more commonly used. Newman also explains that the two formulae are far from equivalent, and shouldn't be used as such. He says they can give substantially different numbers for a given network, however doesn't explain why.
[1] C_i = (number of pairs of neighbours of i that are connected) / (number of pairs of neighbours of i)
[2] Newman, M.E.J.. Networks : an introduction. Oxford New York: Oxford University Press, 2010. Print.
Edit:
I am including here a series of calculations for the same ER random graph. You can see how the two methods give different results, even for undirected graphs. (done using Mathematica)

I think they're equivalent. The wiki page you link to gives a proof that the triples formulation is equivalent to the fraction of possible edges formulation when calculating the local clustering coefficient, i.e. calculated just at a vertex. From there it seems that you just need to show that
sum_v lambda(v)/tau(v) = 3 x # triangles / # connected triples
where lambda(v) is the number of triangles containing v, and tau(v) is the number of connected triples for which v is the middle vertex, i.e. adjacent to each of the other 2 edges.
Now each triangle gets counted three times in the numerator of the LHS. However, each connected triple is only counted once for the middle vertex on the LHS, so the denominators are the same.

I partially disagree with Whatang. These methods are only equivalent for undirected graphs. However for directed graphs they return different results. In my opinion the local clustering coefficient method is the correct one. Not to mention its less computationally expensive. For example
<-----
4 -----> 5
|<--||-->
| ||
|-> 6 -> 7
4(IN [5,6], OUT [5,6])
5(IN [4,6], OUT [4])
6(IN [4], OUT [4,5,7])
7(IN [6], OUT [])
central = 6
localCC = 2 / 4*3 = 1/6
globalCC = 1 / 3

I wouldn't trust that wikipedia article. The first formula you cited is currently defined as the Mean Clustering Coefficient, hence it is the mean of all local clustering coefficients for a graph g. This is in no way the same as the global clustering coefficient, as xk_id aptly put it.

there is a great page to learn the basics from!
http://www.learner.org/courses/mathilluminated/interactives/network/
all about cluster coefficients, small world and so on ...

Nearest neighbors in high-dimensional data?

I have asked a question a few days back on how to find the nearest neighbors for a given vector. My vector is now 21 dimensions and before I proceed further, because I am not from the domain of Machine Learning nor Math, I am beginning to ask myself some fundamental questions:
Is Euclidean distance a good metric for finding the nearest neighbors in the first place? If not, what are my options?
In addition, how does one go about deciding the right threshold for determining the k-neighbors? Is there some analysis that can be done to figure this value out?
Previously, I was suggested to use kd-Trees but the Wikipedia page clearly says that for high-dimensions, kd-Tree is almost equivalent to a brute-force search. In that case, what is the best way to find nearest-neighbors in a million point dataset efficiently?
Can someone please clarify the some (or all) of the above questions?

I currently study such problems -- classification, nearest neighbor searching -- for music information retrieval.
You may be interested in Approximate Nearest Neighbor (ANN) algorithms. The idea is that you allow the algorithm to return sufficiently near neighbors (perhaps not the nearest neighbor); in doing so, you reduce complexity. You mentioned the kd-tree; that is one example. But as you said, kd-tree works poorly in high dimensions. In fact, all current indexing techniques (based on space partitioning) degrade to linear search for sufficiently high dimensions [1][2][3].
Among ANN algorithms proposed recently, perhaps the most popular is Locality-Sensitive Hashing (LSH), which maps a set of points in a high-dimensional space into a set of bins, i.e., a hash table [1][3]. But unlike traditional hashes, a locality-sensitive hash places nearby points into the same bin.
LSH has some huge advantages. First, it is simple. You just compute the hash for all points in your database, then make a hash table from them. To query, just compute the hash of the query point, then retrieve all points in the same bin from the hash table.
Second, there is a rigorous theory that supports its performance. It can be shown that the query time is sublinear in the size of the database, i.e., faster than linear search. How much faster depends upon how much approximation we can tolerate.
Finally, LSH is compatible with any Lp norm for 0 < p <= 2. Therefore, to answer your first question, you can use LSH with the Euclidean distance metric, or you can use it with the Manhattan (L1) distance metric. There are also variants for Hamming distance and cosine similarity.
A decent overview was written by Malcolm Slaney and Michael Casey for IEEE Signal Processing Magazine in 2008 [4].
LSH has been applied seemingly everywhere. You may want to give it a try.
[1] Datar, Indyk, Immorlica, Mirrokni, "Locality-Sensitive Hashing Scheme Based on p-Stable Distributions," 2004.
[2] Weber, Schek, Blott, "A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces," 1998.
[3] Gionis, Indyk, Motwani, "Similarity search in high dimensions via hashing," 1999.
[4] Slaney, Casey, "Locality-sensitive hashing for finding nearest neighbors", 2008.

I. The Distance Metric
First, the number of features (columns) in a data set is not a factor in selecting a distance metric for use in kNN. There are quite a few published studies directed to precisely this question, and the usual bases for comparison are:
the underlying statistical
distribution of your data;
the relationship among the features
that comprise your data (are they
independent--i.e., what does the
covariance matrix look like); and
the coordinate space from which your
data was obtained.
If you have no prior knowledge of the distribution(s) from which your data was sampled, at least one (well documented and thorough) study concludes that Euclidean distance is the best choice.
YEuclidean metric used in mega-scale Web Recommendation Engines as well as in current academic research. Distances calculated by Euclidean have intuitive meaning and the computation scales--i.e., Euclidean distance is calculated the same way, whether the two points are in two dimension or in twenty-two dimension space.
It has only failed for me a few times, each of those cases Euclidean distance failed because the underlying (cartesian) coordinate system was a poor choice. And you'll usually recognize this because for instance path lengths (distances) are no longer additive--e.g., when the metric space is a chessboard, Manhattan distance is better than Euclidean, likewise when the metric space is Earth and your distances are trans-continental flights, a distance metric suitable for a polar coordinate system is a good idea (e.g., London to Vienna is is 2.5 hours, Vienna to St. Petersburg is another 3 hrs, more or less in the same direction, yet London to St. Petersburg isn't 5.5 hours, instead, is a little over 3 hrs.)
But apart from those cases in which your data belongs in a non-cartesian coordinate system, the choice of distance metric is usually not material. (See this blog post from a CS student, comparing several distance metrics by examining their effect on kNN classifier--chi square give the best results, but the differences are not large; A more comprehensive study is in the academic paper, Comparative Study of Distance Functions for Nearest Neighbors--Mahalanobis (essentially Euclidean normalized by to account for dimension covariance) was the best in this study.
One important proviso: for distance metric calculations to be meaningful, you must re-scale your data--rarely is it possible to build a kNN model to generate accurate predictions without doing this. For instance, if you are building a kNN model to predict athletic performance, and your expectation variables are height (cm), weight (kg), bodyfat (%), and resting pulse (beats per minute), then a typical data point might look something like this: [ 180.4, 66.1, 11.3, 71 ]. Clearly the distance calculation will be dominated by height, while the contribution by bodyfat % will be almost negligible. Put another way, if instead, the data were reported differently, so that bodyweight was in grams rather than kilograms, then the original value of 86.1, would be 86,100, which would have a large effect on your results, which is exactly what you don't want. Probably the most common scaling technique is subtracting the mean and dividing by the standard deviation (mean and sd refer calculated separately for each column, or feature in that data set; X refers to an individual entry/cell within a data row):
X_new = (X_old - mu) / sigma
II. The Data Structure
If you are concerned about performance of the kd-tree structure, A Voronoi Tessellation is a conceptually simple container but that will drastically improve performance and scales better than kd-Trees.
This is not the most common way to persist kNN training data, though the application of VT for this purpose, as well as the consequent performance advantages, are well-documented (see e.g. this Microsoft Research report). The practical significance of this is that, provided you are using a 'mainstream' language (e.g., in the TIOBE Index) then you ought to find a library to perform VT. I know in Python and R, there are multiple options for each language (e.g., the voronoi package for R available on CRAN)
Using a VT for kNN works like this::
From your data, randomly select w points--these are your Voronoi centers. A Voronoi cell encapsulates all neighboring points that are nearest to each center. Imagine if you assign a different color to each of Voronoi centers, so that each point assigned to a given center is painted that color. As long as you have a sufficient density, doing this will nicely show the boundaries of each Voronoi center (as the boundary that separates two colors.
How to select the Voronoi Centers? I use two orthogonal guidelines. After random selecting the w points, calculate the VT for your training data. Next check the number of data points assigned to each Voronoi center--these values should be about the same (given uniform point density across your data space). In two dimensions, this would cause a VT with tiles of the same size.That's the first rule, here's the second. Select w by iteration--run your kNN algorithm with w as a variable parameter, and measure performance (time required to return a prediction by querying the VT).
So imagine you have one million data points..... If the points were persisted in an ordinary 2D data structure, or in a kd-tree, you would perform on average a couple million distance calculations for each new data points whose response variable you wish to predict. Of course, those calculations are performed on a single data set. With a V/T, the nearest-neighbor search is performed in two steps one after the other, against two different populations of data--first against the Voronoi centers, then once the nearest center is found, the points inside the cell corresponding to that center are searched to find the actual nearest neighbor (by successive distance calculations) Combined, these two look-ups are much faster than a single brute-force look-up. That's easy to see: for 1M data points, suppose you select 250 Voronoi centers to tesselate your data space. On average, each Voronoi cell will have 4,000 data points. So instead of performing on average 500,000 distance calculations (brute force), you perform far lesss, on average just 125 + 2,000.
III. Calculating the Result (the predicted response variable)
There are two steps to calculating the predicted value from a set of kNN training data. The first is identifying n, or the number of nearest neighbors to use for this calculation. The second is how to weight their contribution to the predicted value.
W/r/t the first component, you can determine the best value of n by solving an optimization problem (very similar to least squares optimization). That's the theory; in practice, most people just use n=3. In any event, it's simple to run your kNN algorithm over a set of test instances (to calculate predicted values) for n=1, n=2, n=3, etc. and plot the error as a function of n. If you just want a plausible value for n to get started, again, just use n = 3.
The second component is how to weight the contribution of each of the neighbors (assuming n > 1).
The simplest weighting technique is just multiplying each neighbor by a weighting coefficient, which is just the 1/(dist * K), or the inverse of the distance from that neighbor to the test instance often multiplied by some empirically derived constant, K. I am not a fan of this technique because it often over-weights the closest neighbors (and concomitantly under-weights the more distant ones); the significance of this is that a given prediction can be almost entirely dependent on a single neighbor, which in turn increases the algorithm's sensitivity to noise.
A must better weighting function, which substantially avoids this limitation is the gaussian function, which in python, looks like this:
def weight_gauss(dist, sig=2.0) :
return math.e**(-dist**2/(2*sig**2))
To calculate a predicted value using your kNN code, you would identify the n nearest neighbors to the data point whose response variable you wish to predict ('test instance'), then call the weight_gauss function, once for each of the n neighbors, passing in the distance between each neighbor the the test point.This function will return the weight for each neighbor, which is then used as that neighbor's coefficient in the weighted average calculation.

What you are facing is known as the curse of dimensionality. It is sometimes useful to run an algorithm like PCA or ICA to make sure that you really need all 21 dimensions and possibly find a linear transformation which would allow you to use less than 21 with approximately the same result quality.
Update:
I encountered them in a book called Biomedical Signal Processing by Rangayyan (I hope I remember it correctly). ICA is not a trivial technique, but it was developed by researchers in Finland and I think Matlab code for it is publicly available for download. PCA is a more widely used technique and I believe you should be able to find its R or other software implementation. PCA is performed by solving linear equations iteratively. I've done it too long ago to remember how. = )
The idea is that you break up your signals into independent eigenvectors (discrete eigenfunctions, really) and their eigenvalues, 21 in your case. Each eigenvalue shows the amount of contribution each eigenfunction provides to each of your measurements. If an eigenvalue is tiny, you can very closely represent the signals without using its corresponding eigenfunction at all, and that's how you get rid of a dimension.

Top answers are good but old, so I'd like to add up a 2016 answer.
As said, in a high dimensional space, the curse of dimensionality lurks around the corner, making the traditional approaches, such as the popular k-d tree, to be as slow as a brute force approach. As a result, we turn our interest in Approximate Nearest Neighbor Search (ANNS), which in favor of some accuracy, speedups the process. You get a good approximation of the exact NN, with a good propability.
Hot topics that might be worthy:
Modern approaches of LSH, such as Razenshteyn's.
RKD forest: Forest(s) of Randomized k-d trees (RKD), as described in FLANN,
or in a more recent approach I was part of, kd-GeRaF.
LOPQ which stands for Locally Optimized Product Quantization, as described here. It is very similar to the new Babenko+Lemptitsky's approach.
You can also check my relevant answers:
Two sets of high dimensional points: Find the nearest neighbour in the other set
Comparison of the runtime of Nearest Neighbor queries on different data structures
PCL kd-tree implementation extremely slow

To answer your questions one by one:
No, euclidean distance is a bad metric in high dimensional space. Basically in high dimensions, data points have large differences between each other. That decreases the relative difference in the distance between a given data point and its nearest and farthest neighbour.
Lot of papers/research are there in high dimension data, but most of the stuff requires a lot of mathematical sophistication.
KD tree is bad for high dimensional data ... avoid it by all means
Here is a nice paper to get you started in the right direction. "When in Nearest Neighbour meaningful?" by Beyer et all.
I work with text data of dimensions 20K and above. If you want some text related advice, I might be able to help you out.

Cosine similarity is a common way to compare high-dimension vectors. Note that since it's a similarity not a distance, you'd want to maximize it not minimize it. You can also use a domain-specific way to compare the data, for example if your data was DNA sequences, you could use a sequence similarity that takes into account probabilities of mutations, etc.
The number of nearest neighbors to use varies depending on the type of data, how much noise there is, etc. There are no general rules, you just have to find what works best for your specific data and problem by trying all values within a range. People have an intuitive understanding that the more data there is, the fewer neighbors you need. In a hypothetical situation where you have all possible data, you only need to look for the single nearest neighbor to classify.
The k Nearest Neighbor method is known to be computationally expensive. It's one of the main reasons people turn to other algorithms like support vector machines.

kd-trees indeed won't work very well on high-dimensional data. Because the pruning step no longer helps a lot, as the closest edge - a 1 dimensional deviation - will almost always be smaller than the full-dimensional deviation to the known nearest neighbors.
But furthermore, kd-trees only work well with Lp norms for all I know, and there is the distance concentration effect that makes distance based algorithms degrade with increasing dimensionality.
For further information, you may want to read up on the curse of dimensionality, and the various variants of it (there is more than one side to it!)
I'm not convinced there is a lot use to just blindly approximating Euclidean nearest neighbors e.g. using LSH or random projections. It may be necessary to use a much more fine tuned distance function in the first place!

A lot depends on why you want to know the nearest neighbors. You might look into the mean shift algorithm http://en.wikipedia.org/wiki/Mean-shift if what you really want is to find the modes of your data set.

I think cosine on tf-idf of boolean features would work well for most problems. That's because its time-proven heuristic used in many search engines like Lucene. Euclidean distance in my experience shows bad results for any text-like data. Selecting different weights and k-examples can be done with training data and brute-force parameter selection.

iDistance is probably the best for exact knn retrieval in high-dimensional data. You can view it as an approximate Voronoi tessalation.

I've experienced the same problem and can say the following.
Euclidean distance is a good distance metric, however it's computationally more expensive than the Manhattan distance, and sometimes yields slightly poorer results, thus, I'd choose the later.
The value of k can be found empirically. You can try different values and check the resulting ROC curves or some other precision/recall measure in order to find an acceptable value.
Both Euclidean and Manhattan distances respect the Triangle inequality, thus you can use them in metric trees. Indeed, KD-trees have their performance severely degraded when the data have more than 10 dimensions (I've experienced that problem myself). I found VP-trees to be a better option.

KD Trees work fine for 21 dimensions, if you quit early,
after looking at say 5 % of all the points.
FLANN does this (and other speedups)
to match 128-dim SIFT vectors. (Unfortunately FLANN does only the Euclidean metric,
and the fast and solid
scipy.spatial.cKDTree
does only Lp metrics;
these may or may not be adequate for your data.)
There is of course a speed-accuracy tradeoff here.
(If you could describe your Ndata, Nquery, data distribution,
that might help people to try similar data.)
Added 26 April, run times for cKDTree with cutoff on my old mac ppc, to give a very rough idea of feasibility:
kdstats.py p=2 dim=21 N=1000000 nask=1000 nnear=2 cutoff=1000 eps=0 leafsize=10 clustype=uniformp
14 sec to build KDtree of 1000000 points
kdtree: 1000 queries looked at av 0.1 % of the 1000000 points, 0.31 % of 188315 boxes; better 0.0042 0.014 0.1 %
3.5 sec to query 1000 points
distances to 2 nearest: av 0.131 max 0.253
kdstats.py p=2 dim=21 N=1000000 nask=1000 nnear=2 cutoff=5000 eps=0 leafsize=10 clustype=uniformp
14 sec to build KDtree of 1000000 points
kdtree: 1000 queries looked at av 0.48 % of the 1000000 points, 1.1 % of 188315 boxes; better 0.0071 0.026 0.5 %
15 sec to query 1000 points
distances to 2 nearest: av 0.131 max 0.245

You could try a z order curve. It's easy for 3 dimension.

I had a similar question a while back. For fast Approximate Nearest Neighbor Search you can use the annoy library from spotify: https://github.com/spotify/annoy
This is some example code for the Python API, which is optimized in C++.
from annoy import AnnoyIndex
import random
f = 40
t = AnnoyIndex(f, 'angular') # Length of item vector that will be indexed
for i in range(1000):
v = [random.gauss(0, 1) for z in range(f)]
t.add_item(i, v)
t.build(10) # 10 trees
t.save('test.ann')
# ...
u = AnnoyIndex(f, 'angular')
u.load('test.ann') # super fast, will just mmap the file
print(u.get_nns_by_item(0, 1000)) # will find the 1000 nearest neighbors
They provide different distance measurements. Which distance measurement you want to apply depends highly on your individual problem. Also consider prescaling (meaning weighting) certain dimensions for importance first. Those dimension or feature importance weights might be calculated by something like entropy loss or if you have a supervised learning problem gini impurity gain or mean average loss, where you check how much worse your machine learning model performs, if you scramble this dimensions values.
Often the direction of the vector is more important than it's absolute value. For example in the semantic analysis of text documents, where we want document vectors to be close when their semantics are similar, not their lengths. Thus we can either normalize those vectors to unit length or use angular distance (i.e. cosine similarity) as a distance measurement.
Hope this is helpful.

Is Euclidean distance a good metric for finding the nearest neighbors in the first place? If not, what are my options?
I would suggest soft subspace clustering, a pretty common approach nowadays, where feature weights are calculated to find the most relevant dimensions. You can use these weights when using euclidean distance, for example. See curse of dimensionality for common problems and also this article can enlighten you somehow:
A k-means type clustering algorithm for subspace clustering of mixed numeric and
categorical datasets

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio