knn(k nearest neighbor) density estimation source in matlab - nearest-neighbor

Is there any function/package to perform k-Nearest Neighbor based density estimation in matlab?
or open source..
not knn classification. only density estimation, please.

My guess is no. My search led me to this:
Classification Using Nearest Neighbors
where you can see how you can use NN search for classification and:
You can use kNN search for other machine learning algorithms, such as:
-> density estimation
On this link one can find some nice theory on the topics.
PS - I am not sure that there is no such thing you are asking, but it exists, it is very well hidden(which means that not many people use it), thus it probably doesn't exist.

Related

3D Nearest neighbour for query points located far away from set of points

I need to answer a lot of queries about finding nearest neighbour in pointset, locating far away from the query point. All approaches I've found so far work bad in this case (for example, k-d tree may have O(N) per query) or require use of Voronoi diagram (I have ~10m points so Voronoi diagram is too expensive).
Is there any known algorithm designed for such a task?
The problem here are the distances. You see, when a query is far from your dataset, then the kd-tree has to check many points, thus slowing down the query time.
The scenario you are facing is hard for the Nearest Neighbor Structures in general (and it's not the usual case), but if I were you, I would give a shot with Balanced Box-Decomposition trees, where you can read more about their algorithm and data structure.
Some multidimensional indexes have kNN queries that could be easily adapted to you needs, especially with k==1.
kNN algorithms usually have to first estimate the approximate nearest neighbour distance, then they use this distance to perform a range query.
In R-Trees or quadtrees, this estimation can be done efficiently by finding the node that is closest to your search point. Then they take one point from the closest node, calculate the distance to the search point, and then perform a range query based on this distance, usually with some multiplier because k>1.
This should be reasonable efficient even if the search point is far away.
If you are searching for one point only (k=1) then you could adapt this algorithm to use a range query that is exactly based on the closest point you found, no extra extension to get k>1 points.
If you are using Java, you could use my open-source implementations here. There is also a PH-Tree (a kind of quadtree, but much more space efficient and faster to load), which uses the same kNN approach.

Demons algorithm for image registration (for dummies)

I was trying to make a application that compares the difference between 2 images in java with opencv. After trying various approaches I came across the algorithm called Demons algorithm.
To me it seems to give the difference of images by some transformation on each place. But I couldn't understand it since the references I found were too complex for me.
Even the demons algorithm does not do what I need I'm interested in learning it.
Can any one explain simply what happens in the demons algorithm and how to write a simple code to use that algorithm on 2 images.
I can give you an overview of general algorithms for deformable image registration, demons is one of them
There are 3 components of the algorithm, a similarity metric, a transformation model and an optimization algorithm.
A similarity metric is used to compute pixel based / patch based similarity between pixels/patches. Common similarity measures are SSD, normalized cross correlation for mono-modal images while information theoretic measures like mutual information are used in the case of multi-modal image registration.
In the case of deformable registration, they generally have a regular grid super-imposed over the image and the grid is deformed by solving an optimization problem which is formulated such that the similarity metric and the smoothness penalty imposed over the transformation is minimized. In deformable registration, once there are deformations over the grid, the final transformation at the pixel level is computed using a B-Spine interpolation of the grid at the pixel level so that the transformation is smooth and continuous.
There are 2 general approaches towards solving the optimization problem, some people use discrete optimization and solve it as a MRF optimization problem while some people use gradient descent, I think demons uses gradient descent.
In case of MRF based approaches, the unary cost is the cost for deforming each node in grid and it is the similarity computed between patches, the pairwise cost which imposes the smoothness of the grid, is generally a potts/truncated quadratic potential which ensures that neighboring nodes in the grid have almost the same displacement. Once you have the unary and pairwise cost, you feed it to a MRF optimization algorithm and get the displacements at the grid level, then you use a B-Spline interpolation to compute pixel level displacement. This process is repeated in a coarse to fine fashion over several scales and also the algorithm is run many times at each scale (reducing the displacement at each node every time).
In case of gradient descent based methods, they formulate the problem with the similarity metric and the grid transformation computed over the image and then compute the gradient of the energy function which they have formulated. The energy function is minimized using iterative gradient descent, however these approaches can get stuck in a local minima and are quite slow.
Some popular methods are DROP, Elastix, itk provides some tools
If you want to know more about algorithms related to deformable image registration, I will recommend you to take a look to FAIR( guide book), FAIR is a toolbox for Matlab so you will have examples to understand the theory.
http://www.cas.mcmaster.ca/~modersit/FAIR/
Then if you want to specifically see some demon example,, here you have this other toolbox:
http://www.mathworks.es/matlabcentral/fileexchange/21451-multimodality-non-rigid-demon-algorithm-image-registration

What data do I need to implement k nearest neighbor?

I currently have a reddit-clone type website. I'm trying to recommend posts based on the posts that my users have previously liked.
It seems like K nearest neighbor or k means are the best way to do this.
I can't seem to understand how to actually implement this. I've seen some mathematical formulas (such as the one on the k means wikipedia page), but they don't really make sense to me.
Could someone maybe recommend some pseudo code, or places to look so I can get a better feel on how to do this?
K-Nearest Neighbor (aka KNN) is a classification algorithm.
Basically, you take a training group of N items and classify them. How you classify them is completely dependent on your data, and what you think the important classification characteristics of that data are. In your example, this may be category of posts, who posted the item, who upvoted the item, etc.
Once this 'training' data has been classified, you can then evaluate an 'unknown' data point. You determine the 'class' of the unknown by locating the nearest neighbors to it in the classification system. If you determine the classification by the 3 nearest neighbors, it could then be called a 3-nearest neighboring algorithm.
How you determine the 'nearest neighbor' depends heavily on how you classify your data. It is very common to plot the data into N-dimensional space where N represents the number of different classification characteristics you are examining.
A trivial example:
Let's say you have the longitude/latitude coordinates of a location that can be on any landmass anywhere in the world. Let us also assume that you do not have a map, but you do have a very large data set that gives you the longitude/latitude of many different cities in the world, and you also know which country those cities are in.
If I asked you which country the a random longitude latitude point is in, would you be able to figure it out? What would you do to figure it out?
Longitude/latitude data falls naturally into an X,Y graph. So, if you plotted out all the cities onto this graph, and then the unknown point, how would you figure out the country of the unknown? You might start drawing circles around that point, growing increasingly larger until the circle encompasses the 10 nearest cities on the plot. Now, you can look at the countries of those 10 cities. If all 10 are in the USA, then you can say with a fair degree of certainty that your unknown point is also in the USA. But if only 6 cities are in the USA, and the other 4 are in Canada, can you say where your unknown point is? You may still guess USA, but with less certainty.
The toughest part of KNN is figuring out how to classify your data in a way that you can determine 'neighbors' of similar quality, and the distance to those neighbors.
What you described sounds like a recommender system engine, not a clustering algorithm like k-means which in essence is an unsupervised approach. I cannot make myself a clear idea of what reddit uses actually, but I found some interesting post by googling around "recommender + reddit", e.g. Reddit, Stumbleupon, Del.icio.us and Hacker News Algorithms Exposed! Anyway, the k-NN algorithm (described in the top ten data mining algorithm, with pseudo-code on Wikipedia) might be used, or other techniques like Collaborative filtering (used by Amazon, for example), described in this good tutorial.
k-Means clustering in its simplest form is averaging values and keep other average values around one central average value. Suppose you have the following values
1,2,3,4,6,7,8,9,10,11,12,21,22,33,40
Now if I do k-means clustering and remember that the k-means clustering will have a biasing (means/averaging) mechanism that shall either put values close to the center or far away from it. And we get the following.
cluster-1
1,2,3,4,5,6,7,8
cluster-2
10,11,12
cluster-3
21,22
cluster-4
33
cluster-5
40
Remember I just made up these cluster centers (cluster 1-5).
So the next, time you do clustering, the numbers would end up around any of these central means (also known as k-centers). The data above is single dimensional.
When you perform kmeans clustering on large data sets, with multi dimension (A multidimensional data is an array of values, you will have millions of them of the same dimension), you will need something bigger and scalable. You will first average one array, you will get a single value, like wise you will repeat the same for other arrays, and then perform the kmean clustering.
Read one of my questions Here
Hope this helps.
To do k-nearest neighbors you mostly need a notion of distance and a way of finding the k nearest neighbours to a point that you can afford (you probably don't want to search through all your data points one by one). There is a library for approximate nearest neighbour at http://www.cs.umd.edu/~mount/ANN/. It's a very simple classification algorithm - to classify a new point p, find its k nearest neighbours and classify p according to the most popular classes amongst those k neighbours.
I guess in your case you could provide somebody with a list of similar posts as soon as you decide what nearest means, and then monitor click-through from this and try to learn from that to predict which of those alternatives would be most popular.
If you are interested in finding a particularly good learning algorithm for your purposes, have a look at http://www.cs.waikato.ac.nz/ml/weka/ - it allows you to try out a large number of different algorithms, and also to write your own as plug-ins.
Here is a very simple example of KNN for the MINST dataset
Once you are able to calculate distance between your documents, the same algorithm would work
http://shyamalapriya.github.io/digit-recognition-using-k-nearest-neighbors/

Explain 0-extension algorithm

I'm trying to implement the 0-extension algorithm.
It is used to colour a graph with a number of colours where some nodes already have a colour assigned and where every edge has a distance. The algorithm calculates an assignment of colours so that neighbouring nodes with the same colour have as much distance between them as possible.
I found this paper explaining the algorithm: http://citeseer.ist.psu.edu/viewdoc/download;jsessionid=1FBA2D22588CABDAA8ECF73B41BD3D72?doi=10.1.1.100.8049&rep=rep1&type=pdf
but I don't see how I need to implement it.
I already asked this question on the "theoretical computer science" site, but halfway the discussion we went beyond the site's scope:
https://cstheory.stackexchange.com/questions/6163/explain-0-extension-algorithm
Can anyone explain this algorithm in layman's terms?
I'm planning to make the final code opensource in the jgrapht package.
The objective of 0-extension is to minimize the total weighted cost of edges with different color endpoints rather than to maximize it, so 0-extension is really a clustering problem rather than a coloring problem. I'm generally skeptical that using a clustering algorithm to color would have good results. If you want something with a theoretical guarantee, you could look into approximations to the MAXCUT problem (really a generalization if there are more than two colors), but I suspect that a local-search algorithm would work better in practice.

Algorithm on trajectory analysis

I would like to analyse trajectory data based on given templates.
I need to stack similar trajectories together.
The data is a set of coordinates (xy, xy, xy) and the templates are again lines defined by the set of control points.
I don't know to what direction to go, maybe to Neural Networks or pattern recognition?
Could you please recommend a page, book or library to start with?
Kind regards,
Arman.
PS:
Is it the right place to ask the question?
EDIT
To be more precise the trajectory contains about 50-100 control points.
Here you can see the example of trajectories:
http://www.youtube.com/watch?v=KFE0JLx6L-o
Your question is a quite vague.
You can use regression analysis (http://en.wikipedia.org/wiki/Regression_analysis) to find the relationship between x and y on a set of coordinates, and then compare that with other of trajectories.
Are there always four coordinates per trajectory? You might want to calculate the euclidian distance between the first coordinates of all trajectories, and then the same for the second and so on.
You might want to normalize the distance and analyze the change in direction instead. It all comes down to what you really need.
If you need to stack similar trajectories together you might be interested in the k-nearest neighbour algorithm (http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm). As for the dimensions to use for that algorithm, you might use your xy coordinates or any derivates.
You can use a clustering algorithm to 'stack the similar trajectories together'. I have used spectral clustering on trajectories with good results. Depending on your application hierarchical clustering may be more apropriate.
A critical part of your analysis will be the distance measure between trajectories. State of the art is dynamic time warping. I've also seen good results achieved with a modified Hausdorff measure.

Resources