What distances measures perform well on content-based Recommendation Systems? - hadoop

I want to implement a content based recommendation system that provides a list of recommended books based on user input.
I`ll be using TF-IDF to determine how important a word is to a given book and will create a Book Characteristic Vector for every Book.
I need to create a similarity matrix to determine to determine possible pair of books. I came across Euclidean Distance for doing that. Any other methods better than Euclidean?

These are some good distance measures you might try:
(generalized) Jaccard distance
Manhattan distance
Hellinger distance
cosine similarity

Related

What is the best algorithm for determining duplicated paths between tracking trips?

I am developing an mobile application for recording user trips. A trip is made by sequences of user positions (with longitude and latitude values).
Now my problem is how to determine a trip has been traveled so far? On the other words, how to determine duplicated paths between trips?
(I know that we could not have 2 trips with the exact same data points, hence, I don't know how to begin with, I am looking for an algorithm could approximately address this problem).
Thank for your help!
There are a couple of trajectory distance measures that could help: Euclidean Distance, Dynamic Time Wraping, Edit Distance with Real Penalty, LCSS, ... Which one to pick depends on how you want to define similarity.
In this paper the authors describe all distance measures and evaluate them.
As far as I understand your scenario an LCSS or ERP based similarity measure might fit. A quick search brought me to this Github Repository

A clustering algorithm that accepts an arbitrary distance function

I have about 200 points in Cartesian plane (2D). I want to cluster these points to k clusters with respect to arbitrary distance function (not matrix) and get the so-called centroid or representatives of these clusters. I know kmeans does this with respect to some special distance functions such as Euclidean, Manhattan, Cosine, etc. But, kmeans cannot handle arbitrary distance function because for example in centroid-updating phase of kmeans with respect to Euclidean distance function, mean of the points in each cluster is the LSE and minimizes the sum of distances of the nodes in the cluster to its centroid (mean); however, mean of the points may not minimize the ditances when the distance function is something arbitrary. Could you please help me about it and tell me if you know about any clustering algorithms that can work for me?
If you replace "mean" with "most central point in cluster", then you get the k-medoids algorithm. Wikipedia claims that a metric is required, but I believe that to be incorrect, since I can't see where the majorization-minimization proof needs the triangle inequality or even symmetry.
There are various clustering algorithms that can work with arbitrary distance functions, in particular:
hierarchical clustering
k-medoids (PAM)
DBSCAN
OPTICS
many many more - get some good clustering book and/or software
But the only one which enforces k clusters and uses a "cluster representative" model is k-medoids. You may be putting too many constraints on the cluster model to get a wider choice.
Since you want something that represents a centroid but is not one of the data points, a technique I once used was to perform something like Kmedoids on N random samples, then I took all the members of each cluster and used them as samples to build a classifier which returned a class label... in the end each class label returned from the classifier ends up being an abstract notion of a set of cluster/centroids. I did this for a very specific and nuanced reason, I know the flaws.
If you don't want to have to specify K, and your vectors are not enormous and super sparse, then I would take a look at the cobweb clustering in JavaML, JavaML also has a decent KMedoids.

Measuring distance between vectors

I have a set of 300.000 or so vectors which I would like to compare in some way, and given one vector I want to be able to find the closest vector I have thought of three methods.
Simple Euclidian distance
Cosine similarity
Use a kernel (for instance Gaussian) to calculate the Gram matrix.
Treat the vector as a discrete probability distribution (which makes
sense to do) and calculate some divergence measure.
I do not really understand when it is useful to do one rather than the other. My data has a lot of zero-elements. With that in mind, is there some general rule of thumbs as to which of the three methods is the best?
Sorry for the weak question, but I had to start somewhere...
Thank you!
Your question is not quite clear, are you looking for a distance metric between vectors, or an algorithm to efficiently find the nearest neighbour?
If your vectors just contain a numeric type such as doubles or integers, you can find a nearest neighbour efficiently using a structure such as the kd-tree. (since you are just looking at points in d-dimensional space). See http://en.wikipedia.org/wiki/Nearest_neighbor_search, for other methods.
Otherwise, choosing a distance metric and algorithm is very much dependent on the content of the vectors.
If your vectors are very sparse in nature and if they are binary, you can use Hamming or Hellinger distance. When your vector dimensions are large, avoid using Euclidean (refer http://en.wikipedia.org/wiki/Curse_of_dimensionality)
Please refer to http://citeseerx.ist.psu.edu/viewdoc/download?rep=rep1&type=pdf&doi=10.1.1.154.8446 for a survey of distance/similarity measures, although the paper limits it to pair of probability distributions.

Algorithm behind distance calculations in The Google Distance Matrix API

How does the Google Distance Matrix API calculate the distance from point A to B. Often there are multiple ways to go from A to B and the question is how Google prioritizes different routes to find the one that is used for the distance calculation. Strategies could be:
Fastest
Shortest
Low risk of queues
Etc.
Sincerely,
Henning
Google map calculation uses a fastest calculation but the distance matrix api can also gives accurate distance in meters. Here is some answer from Nick Johnson unfortunatelynot about the question: What algorithms compute directions from point A to point B on a map?. At least the algorithm is a modified. I think with fastest calculation the map is more flexible. I don't see why they can't switch between both?

What data do I need to implement k nearest neighbor?

I currently have a reddit-clone type website. I'm trying to recommend posts based on the posts that my users have previously liked.
It seems like K nearest neighbor or k means are the best way to do this.
I can't seem to understand how to actually implement this. I've seen some mathematical formulas (such as the one on the k means wikipedia page), but they don't really make sense to me.
Could someone maybe recommend some pseudo code, or places to look so I can get a better feel on how to do this?
K-Nearest Neighbor (aka KNN) is a classification algorithm.
Basically, you take a training group of N items and classify them. How you classify them is completely dependent on your data, and what you think the important classification characteristics of that data are. In your example, this may be category of posts, who posted the item, who upvoted the item, etc.
Once this 'training' data has been classified, you can then evaluate an 'unknown' data point. You determine the 'class' of the unknown by locating the nearest neighbors to it in the classification system. If you determine the classification by the 3 nearest neighbors, it could then be called a 3-nearest neighboring algorithm.
How you determine the 'nearest neighbor' depends heavily on how you classify your data. It is very common to plot the data into N-dimensional space where N represents the number of different classification characteristics you are examining.
A trivial example:
Let's say you have the longitude/latitude coordinates of a location that can be on any landmass anywhere in the world. Let us also assume that you do not have a map, but you do have a very large data set that gives you the longitude/latitude of many different cities in the world, and you also know which country those cities are in.
If I asked you which country the a random longitude latitude point is in, would you be able to figure it out? What would you do to figure it out?
Longitude/latitude data falls naturally into an X,Y graph. So, if you plotted out all the cities onto this graph, and then the unknown point, how would you figure out the country of the unknown? You might start drawing circles around that point, growing increasingly larger until the circle encompasses the 10 nearest cities on the plot. Now, you can look at the countries of those 10 cities. If all 10 are in the USA, then you can say with a fair degree of certainty that your unknown point is also in the USA. But if only 6 cities are in the USA, and the other 4 are in Canada, can you say where your unknown point is? You may still guess USA, but with less certainty.
The toughest part of KNN is figuring out how to classify your data in a way that you can determine 'neighbors' of similar quality, and the distance to those neighbors.
What you described sounds like a recommender system engine, not a clustering algorithm like k-means which in essence is an unsupervised approach. I cannot make myself a clear idea of what reddit uses actually, but I found some interesting post by googling around "recommender + reddit", e.g. Reddit, Stumbleupon, Del.icio.us and Hacker News Algorithms Exposed! Anyway, the k-NN algorithm (described in the top ten data mining algorithm, with pseudo-code on Wikipedia) might be used, or other techniques like Collaborative filtering (used by Amazon, for example), described in this good tutorial.
k-Means clustering in its simplest form is averaging values and keep other average values around one central average value. Suppose you have the following values
1,2,3,4,6,7,8,9,10,11,12,21,22,33,40
Now if I do k-means clustering and remember that the k-means clustering will have a biasing (means/averaging) mechanism that shall either put values close to the center or far away from it. And we get the following.
cluster-1
1,2,3,4,5,6,7,8
cluster-2
10,11,12
cluster-3
21,22
cluster-4
33
cluster-5
40
Remember I just made up these cluster centers (cluster 1-5).
So the next, time you do clustering, the numbers would end up around any of these central means (also known as k-centers). The data above is single dimensional.
When you perform kmeans clustering on large data sets, with multi dimension (A multidimensional data is an array of values, you will have millions of them of the same dimension), you will need something bigger and scalable. You will first average one array, you will get a single value, like wise you will repeat the same for other arrays, and then perform the kmean clustering.
Read one of my questions Here
Hope this helps.
To do k-nearest neighbors you mostly need a notion of distance and a way of finding the k nearest neighbours to a point that you can afford (you probably don't want to search through all your data points one by one). There is a library for approximate nearest neighbour at http://www.cs.umd.edu/~mount/ANN/. It's a very simple classification algorithm - to classify a new point p, find its k nearest neighbours and classify p according to the most popular classes amongst those k neighbours.
I guess in your case you could provide somebody with a list of similar posts as soon as you decide what nearest means, and then monitor click-through from this and try to learn from that to predict which of those alternatives would be most popular.
If you are interested in finding a particularly good learning algorithm for your purposes, have a look at http://www.cs.waikato.ac.nz/ml/weka/ - it allows you to try out a large number of different algorithms, and also to write your own as plug-ins.
Here is a very simple example of KNN for the MINST dataset
Once you are able to calculate distance between your documents, the same algorithm would work
http://shyamalapriya.github.io/digit-recognition-using-k-nearest-neighbors/

Resources