Similarity for arrays of parts of speech - algorithm

K-nearest neighbor and natural language processing: How do you test the distance between arrays of parts of speech? eg
('verb','adverb','noun') and ('adjective','adverb','pronoun')?
A better phrased question would be how do you tell the similarity between the two in the context that they are parts of speech and not just strings?

As a general approach, you can use the cosine between POS vectors as a measure of their similarity. Alternative approach would be using the hamming distance between the two vectors.
There are plenty of other distance functions between vectors. But it really depends on what you want to do and what does your data look like. You should answer questions like does the position matter? How much similarity would you give to these vectors? ('noun', 'verb') and ('verb', 'noun')? Is the distance between ('adverb') and ('adjective') less than distance between ('adverb') and ('noun')? and so on.

Related

Algorithm: How to smoothly interpolate/reconstruct sparse samples with noise?

This question is not directly related to a particular programming language but is an algorithmic question.
What I have is a lot of samples of a 2D function. The samples are at random locations, they are not uniformly distributed over the domain, the sample values contain noise and each sample has a confidence-weight assigned to it.
What I'm looking for is an algorithm to reconstruct the original 2D function based on the samples, so a function y' = G(x0, x1) that approximates the original well and interpolates areas where samples are sparse smoothly.
It goes into the direction of what scipy.interpolate.griddata is doing, but with the added difficulty that:
the sample values contain noise - meaning that samples should not just be interpolated, but nearby samples also averaged in some way to average out the sampling noise.
the samples are weighted, so, samples with higher weight should contrbute more strongly to the reconstruction that those with lower weight.
scipy.interpolate.griddata seems to do a Delaunay triangulation and then use the barycentric cordinates of the triangles to interpolate values. This doesn't seem to be compatible with my requirement of weighting samples and averaging noise though.
Can someone point me in the right direction on how to solve this?
Based on the comments, the function is defined on a sphere. That simplifies life because your region is both well-studied and nicely bounded!
First, decide how many Spherical Harmonic functions you will use in your approximation. The fewer you use, the more you smooth out noise. The more you use, the more accurate it will be. But if you use any of a particular degree, you should use all of them.
And now you just impose the condition that the sum of the squares of the weighted errors should be minimized. That will lead to a system of linear equations, which you then solve to get the coefficients of each harmonic function.

What distances measures perform well on content-based Recommendation Systems?

I want to implement a content based recommendation system that provides a list of recommended books based on user input.
I`ll be using TF-IDF to determine how important a word is to a given book and will create a Book Characteristic Vector for every Book.
I need to create a similarity matrix to determine to determine possible pair of books. I came across Euclidean Distance for doing that. Any other methods better than Euclidean?
These are some good distance measures you might try:
(generalized) Jaccard distance
Manhattan distance
Hellinger distance
cosine similarity

Good algorithm for finding subsets of point sets

I'm trying to find suitable algorithms for searching subsets of 2D points in larger set.
A picture is worth thousand words, so:
Any ideas on how one could achieve this? Note that the transformations are just rotation and scaling.
It seems that the most closely problem is Point set registration [1].
I was experimenting with CPD and other rigid and non-rigid algorithms' implementations, but they don't seem to perform
too well on finding small subsets in larger sets of points.
Another approach could be using star tracking algorithms like the Angle method mentioned in [2]
or more robust methods like [3]. But again, they all seem to be meant for large input sets and target sets. I'm looking for something less reliable but more minimalistic...
Thanks for any ideas!
[1]: http://en.wikipedia.org/wiki/Point_set_registration
[2]: http://www.acsu.buffalo.edu/~johnc/star_gnc04.pdf
[3]: http://arxiv.org/abs/0910.2233
here's some papers probably related to your question:
Geometric Pattern Matching under Euclidean Motion (1993) by L. Paul Chew , Michael T. Goodrich , Daniel P. Huttenlocher , Klara Kedem , Jon M. Kleinberg , Dina Kravets.
A fast expected time algorithm for the 2-D point pattern (2004) by Wamelena, Iyengarb.
Simple algorithms for partial point set pattern matching under rigid motion (2006) by Bishnua, Dasb, Nandyb, Bhattacharyab.
Exact and approximate Geometric Pattern Matching for point sets in the plane under similarity transformations (2007) by Aiger and Kedem.
and by the way, your last reference reminded me of:
An Application of Point Pattern Matching in Astronautics (1994) by G. Weber, L. Knipping and H. Alt.
I think you should start with a subset of the input points and determine the required transformation to match a subset of the large set. For example:
choose any two points of the input, say A and B.
map A and B to a pair of the large set. This will determine the scale and two rotation angles (clockwise or counter clockwise)
apply the same scaling and transformation to a third input point C and check the large set to see if a point exists there. You'll have to check two positions, one for each of rotation angle. If the point C exists where it should be in the large set, you can check the rest of the points.
repeat for each pair of points in the large set
I think you could also try to match a subset of 3 input points, knowing that the angles of a triangle will be invariant under scaling and rotations.
Those are my ideas, I hope they help solve your problem.
I would try the Iterative Closest Point algorithm. A simple version like the one you need should be easy to implement.
Take a look at geometric hashing. It allows finding geometric patterns under different transformations. If you use only rotation and scale, it will be quite simple.
The main idea is to encode the pattern in "native" coordinates, which is invariant under transformations.
You can try a geohash. Translate the points to a binary and interleave it. Measure the distance and compare it with the original. You can also try to rotate the geohash, i.e. z-curve or morton curve.

Measuring distance between vectors

I have a set of 300.000 or so vectors which I would like to compare in some way, and given one vector I want to be able to find the closest vector I have thought of three methods.
Simple Euclidian distance
Cosine similarity
Use a kernel (for instance Gaussian) to calculate the Gram matrix.
Treat the vector as a discrete probability distribution (which makes
sense to do) and calculate some divergence measure.
I do not really understand when it is useful to do one rather than the other. My data has a lot of zero-elements. With that in mind, is there some general rule of thumbs as to which of the three methods is the best?
Sorry for the weak question, but I had to start somewhere...
Thank you!
Your question is not quite clear, are you looking for a distance metric between vectors, or an algorithm to efficiently find the nearest neighbour?
If your vectors just contain a numeric type such as doubles or integers, you can find a nearest neighbour efficiently using a structure such as the kd-tree. (since you are just looking at points in d-dimensional space). See http://en.wikipedia.org/wiki/Nearest_neighbor_search, for other methods.
Otherwise, choosing a distance metric and algorithm is very much dependent on the content of the vectors.
If your vectors are very sparse in nature and if they are binary, you can use Hamming or Hellinger distance. When your vector dimensions are large, avoid using Euclidean (refer http://en.wikipedia.org/wiki/Curse_of_dimensionality)
Please refer to http://citeseerx.ist.psu.edu/viewdoc/download?rep=rep1&type=pdf&doi=10.1.1.154.8446 for a survey of distance/similarity measures, although the paper limits it to pair of probability distributions.

Algorithm on trajectory analysis

I would like to analyse trajectory data based on given templates.
I need to stack similar trajectories together.
The data is a set of coordinates (xy, xy, xy) and the templates are again lines defined by the set of control points.
I don't know to what direction to go, maybe to Neural Networks or pattern recognition?
Could you please recommend a page, book or library to start with?
Kind regards,
Arman.
PS:
Is it the right place to ask the question?
EDIT
To be more precise the trajectory contains about 50-100 control points.
Here you can see the example of trajectories:
http://www.youtube.com/watch?v=KFE0JLx6L-o
Your question is a quite vague.
You can use regression analysis (http://en.wikipedia.org/wiki/Regression_analysis) to find the relationship between x and y on a set of coordinates, and then compare that with other of trajectories.
Are there always four coordinates per trajectory? You might want to calculate the euclidian distance between the first coordinates of all trajectories, and then the same for the second and so on.
You might want to normalize the distance and analyze the change in direction instead. It all comes down to what you really need.
If you need to stack similar trajectories together you might be interested in the k-nearest neighbour algorithm (http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm). As for the dimensions to use for that algorithm, you might use your xy coordinates or any derivates.
You can use a clustering algorithm to 'stack the similar trajectories together'. I have used spectral clustering on trajectories with good results. Depending on your application hierarchical clustering may be more apropriate.
A critical part of your analysis will be the distance measure between trajectories. State of the art is dynamic time warping. I've also seen good results achieved with a modified Hausdorff measure.

Resources