I currently have a reddit-clone type website. I'm trying to recommend posts based on the posts that my users have previously liked.
It seems like K nearest neighbor or k means are the best way to do this.
I can't seem to understand how to actually implement this. I've seen some mathematical formulas (such as the one on the k means wikipedia page), but they don't really make sense to me.
Could someone maybe recommend some pseudo code, or places to look so I can get a better feel on how to do this?
K-Nearest Neighbor (aka KNN) is a classification algorithm.
Basically, you take a training group of N items and classify them. How you classify them is completely dependent on your data, and what you think the important classification characteristics of that data are. In your example, this may be category of posts, who posted the item, who upvoted the item, etc.
Once this 'training' data has been classified, you can then evaluate an 'unknown' data point. You determine the 'class' of the unknown by locating the nearest neighbors to it in the classification system. If you determine the classification by the 3 nearest neighbors, it could then be called a 3-nearest neighboring algorithm.
How you determine the 'nearest neighbor' depends heavily on how you classify your data. It is very common to plot the data into N-dimensional space where N represents the number of different classification characteristics you are examining.
A trivial example:
Let's say you have the longitude/latitude coordinates of a location that can be on any landmass anywhere in the world. Let us also assume that you do not have a map, but you do have a very large data set that gives you the longitude/latitude of many different cities in the world, and you also know which country those cities are in.
If I asked you which country the a random longitude latitude point is in, would you be able to figure it out? What would you do to figure it out?
Longitude/latitude data falls naturally into an X,Y graph. So, if you plotted out all the cities onto this graph, and then the unknown point, how would you figure out the country of the unknown? You might start drawing circles around that point, growing increasingly larger until the circle encompasses the 10 nearest cities on the plot. Now, you can look at the countries of those 10 cities. If all 10 are in the USA, then you can say with a fair degree of certainty that your unknown point is also in the USA. But if only 6 cities are in the USA, and the other 4 are in Canada, can you say where your unknown point is? You may still guess USA, but with less certainty.
The toughest part of KNN is figuring out how to classify your data in a way that you can determine 'neighbors' of similar quality, and the distance to those neighbors.
What you described sounds like a recommender system engine, not a clustering algorithm like k-means which in essence is an unsupervised approach. I cannot make myself a clear idea of what reddit uses actually, but I found some interesting post by googling around "recommender + reddit", e.g. Reddit, Stumbleupon, Del.icio.us and Hacker News Algorithms Exposed! Anyway, the k-NN algorithm (described in the top ten data mining algorithm, with pseudo-code on Wikipedia) might be used, or other techniques like Collaborative filtering (used by Amazon, for example), described in this good tutorial.
k-Means clustering in its simplest form is averaging values and keep other average values around one central average value. Suppose you have the following values
1,2,3,4,6,7,8,9,10,11,12,21,22,33,40
Now if I do k-means clustering and remember that the k-means clustering will have a biasing (means/averaging) mechanism that shall either put values close to the center or far away from it. And we get the following.
cluster-1
1,2,3,4,5,6,7,8
cluster-2
10,11,12
cluster-3
21,22
cluster-4
33
cluster-5
40
Remember I just made up these cluster centers (cluster 1-5).
So the next, time you do clustering, the numbers would end up around any of these central means (also known as k-centers). The data above is single dimensional.
When you perform kmeans clustering on large data sets, with multi dimension (A multidimensional data is an array of values, you will have millions of them of the same dimension), you will need something bigger and scalable. You will first average one array, you will get a single value, like wise you will repeat the same for other arrays, and then perform the kmean clustering.
Read one of my questions Here
Hope this helps.
To do k-nearest neighbors you mostly need a notion of distance and a way of finding the k nearest neighbours to a point that you can afford (you probably don't want to search through all your data points one by one). There is a library for approximate nearest neighbour at http://www.cs.umd.edu/~mount/ANN/. It's a very simple classification algorithm - to classify a new point p, find its k nearest neighbours and classify p according to the most popular classes amongst those k neighbours.
I guess in your case you could provide somebody with a list of similar posts as soon as you decide what nearest means, and then monitor click-through from this and try to learn from that to predict which of those alternatives would be most popular.
If you are interested in finding a particularly good learning algorithm for your purposes, have a look at http://www.cs.waikato.ac.nz/ml/weka/ - it allows you to try out a large number of different algorithms, and also to write your own as plug-ins.
Here is a very simple example of KNN for the MINST dataset
Once you are able to calculate distance between your documents, the same algorithm would work
http://shyamalapriya.github.io/digit-recognition-using-k-nearest-neighbors/
Related
I am developing an mobile application for recording user trips. A trip is made by sequences of user positions (with longitude and latitude values).
Now my problem is how to determine a trip has been traveled so far? On the other words, how to determine duplicated paths between trips?
(I know that we could not have 2 trips with the exact same data points, hence, I don't know how to begin with, I am looking for an algorithm could approximately address this problem).
Thank for your help!
There are a couple of trajectory distance measures that could help: Euclidean Distance, Dynamic Time Wraping, Edit Distance with Real Penalty, LCSS, ... Which one to pick depends on how you want to define similarity.
In this paper the authors describe all distance measures and evaluate them.
As far as I understand your scenario an LCSS or ERP based similarity measure might fit. A quick search brought me to this Github Repository
I am calculating pathfinding inside a mesh which I have build a uniform grid around. The nodes (cells in the 3D grid) close to what I deem a "standable" surface I mark as accessible and they are used in my pathfinding. To get alot of detail (like being able to pathfind up small stair cases) the ammount of accessible cells in my grid have grown quite large, several thousand in larger buildings. (every grid cell is 0.5x0.5x0.5 m and the meshes are rooms with real world dimensions). Even though I only use a fraction of the actual cells in my grid for pathfinding the huge ammount slows the algorithm down. Other than that it works fine and finds the correct path through the mesh, using a weighted manhattan distance heuristic.
Imagine my grid looks like that and the mesh is inside it (can be more or less cubes but its always cubical), however the pathfinding will not be calculated on all the small cubes just a few marked as accessible (usually at the bottom of the grid but that can depend on how many floors the mesh has).
I am looking to reduce the search space for the pathfinding... I have looked at clustering like how HPA* does it and other clustering algorithms like Markov but they all seem to be best used with node graphs and not grids. One obvious solution would be to just increase the size of the small cubes building the grid but then I would lose alot of detail in the pathfinding and it would not be as robust. How could I cluster these small cubes? This is how a typical search space looks when I do my pathfinding (blue are accessible, green is path):
and as you see there is a lot of cubes to search through because the distance between them is quite small!
Never mind that the grid is an unoptimal solution for pathfinding for now.
Does anyone have an idea on how to reduce the ammount of cubes in the grid I have to search through and how would I access the neighbors after I reduce the space? :) Right now it only looks at the closest neighbors while expanding the search space.
A couple possibilities come to mind.
Higher-level Pathfinding
The first is that your A* search may be searching the entire problem space. For example, you live in Austin, Texas, and want to get into a particular building somewhere in Alberta, Canada. A simple A* algorithm would search a lot of Mexico and the USA before finally searching Canada for the building.
Consider creating a second layer of A* to solve this problem. You'd first find out which states to travel between to get to Canada, then which provinces to reach Alberta, then Calgary, and then the Calgary Zoo, for example. In a sense, you start with an overview, then fill it in with more detailed paths.
If you have enormous levels, such as skyrim's, you may need to add pathfinding layers between towns (multiple buildings), regions (multiple towns), and even countries (multiple regions). If you were making a GPS system, you might even need continents. If we'd become interstellar, our spaceships might contain pathfinding layers for planets, sectors, and even galaxies.
By using layers, you help to narrow down your search area significantly, especially if different areas don't use the same co-ordinate system! (It's fairly hard to estimate distance for one A* pathfinder if one of the regions needs latitude-longitude, another 3d-cartesian, and the next requires pathfinding through a time dimension.)
More efficient algorithms
Finding efficient algorithms becomes more important in 3 dimensions because there are more nodes to expand while searching. A Dijkstra search which expands x^2 nodes would search x^3, with x being the distance between the start and goal. A 4D game would require yet more efficiency in pathfinding.
One of the benefits of grid-based pathfinding is that you can exploit topographical properties like path symmetry. If two paths consist of the same movements in a different order, you don't need to find both of them. This is where a very efficient algorithm called Jump Point Search comes into play.
Here is a side-by-side comparison of A* (left) and JPS (right). Expanded/searched nodes are shown in red with walls in black:
Notice that they both find the same path, but JPS easily searched less than a tenth of what A* did.
As of now, I haven't seen an official 3-dimensional implementation, but I've helped another user generalize the algorithm to multiple dimensions.
Simplified Meshes (Graphs)
Another way to get rid of nodes during the search is to remove them before the search. For example, do you really need nodes in wide-open areas where you can trust a much more stupid AI to find its way? If you are building levels that don't change, create a script that parses them into the simplest grid which only contains important nodes.
This is actually called 'offline pathfinding'; basically finding ways to calculate paths before you need to find them. If your level will remain the same, running the script for a few minutes each time you update the level will easily cut 90% of the time you pathfind. After all, you've done most of the work before it became urgent. It's like trying to find your way around a new city compared to one you grew up in; knowing the landmarks means you don't really need a map.
Similar approaches to the 'symmetry-breaking' that Jump Point Search uses were introduced by Daniel Harabor, the creator of the algorithm. They are mentioned in one of his lectures, and allow you to preprocess the level to store only jump-points in your pathfinding mesh.
Clever Heuristics
Many academic papers state that A*'s cost function is f(x) = g(x) + h(x), which doesn't make it obvious that you may use other functions, multiply the weight of the cost functions, and even implement heatmaps of territory or recent deaths as functions. These may create sub-optimal paths, but they greatly improve the intelligence of your search. Who cares about the shortest path when your opponent has a choke point on it and has been easily dispatching anybody travelling through it? Better to be certain the AI can reach the goal safely than to let it be stupid.
For example, you may want to prevent the algorithm from letting enemies access secret areas so that they avoid revealing them to the player, and so that they AI seems to be unaware of them. All you need to achieve this is a uniform cost function for any point within those 'off-limits' regions. In a game like this, enemies would simply give up on hunting the player after the path grew too costly. Another cool option is to 'scent' regions the player has been recently (by temporarily increasing the cost of unvisited locations because many algorithms dislike negative costs).
If you know what places you won't need to search, but can't implement in your algorithm's logic, a simple increase to their cost will prevent unnecessary searching. There's a lot of ways to take advantage of heuristics to simplify and inform your pathfinding, but your biggest gains will come from Jump Point Search.
EDIT: Jump Point Search implicitly selects pathfinding direction using the same heuristics as A*, so you may be able to implement heuristics to a small degree, but their cost function won't be the cost of a node, but rather, the cost of traveling between the two nodes. (A* generally searches adjacent nodes, so the distinction between a node's cost and the cost of traveling to it tends to break down.)
Summary
Although octrees/quad-trees/b-trees can be useful in collision-detection, they aren't as applicable to searches because they section a graph based on its coordinates; not on its connections. Layering your graph (mesh in your vocabulary) into super graphs (regions) is a more effective solution.
Hopefully I've covered anything you'll find useful.
Good luck!
I'm working with Docker images which consist of a set of re-usable layers. Now given a collection of images, I would like to combine images which have a large amount of shared layers.
To be more exact: Given a collection of N images, I want to create clusters where all images in a cluster share more than X percent of services with eachother. Each image is only allowed to belong to one cluster.
My own research points in the direction of cluster algorithms where I use a similarity measure to decide which images belong in a cluster together. The similarity measure I know how to write. However, I'm having difficulty finding an exact algorithm or pseudo-algorithm to get started.
Can someone recommend an algorithm to solve this problem or provide pseudo-code please?
EDIT: after some more searching I believe I'm looking for something like this hierarchical clustering ( https://github.com/lbehnke/hierarchical-clustering-java ) but with a threshold X so that neighbors with less than X% similarity don't get combined and stay in a separate cluster.
I believe you are a developer and you have no experience with data science?
There are a number of clustering algorithms and they have their advantages and disadvantages (please consult https://en.wikipedia.org/wiki/Cluster_analysis), but I think solution for your problem is easier than one can think.
I assume that N is small enough so you can store a matrix with N^2 float values in RAM memory? If this is the case, you are in a very comfortable situation. You write that you know how to implement similarity measure, so just calculate the measure for all N^2 pairs and store it in a matrix (it is a symmetric matrix, so only half of it can be stored). Please ensure that your similarity measure assigns special value for pair of images, where similarity measure is less than some X%, like 0 or infinity (it depends on that you treat a function like similarity measure or like a distance). I think perfect solution is to assign 1 for pairs, where similarity is greater than X% threshold and 0 otherwise.
After that, treat is just like a graph. Get first vertex and make, e.g., deep first search or any other graph walking routine. This is your first cluster. After that get first not visited vertex and repeat graph walking. Of course you can store graph as an adjacency list to save memory.
This algorithm assumes that you really do not pay attention to that how much images are similar and which pairs are more similar than other, but if they are similar enough (similarity measure is greater than a given threshold).
Unfortunately in cluster analysis it is common that 100% of possible pairs has to be computed. It is possible to save some number of distance calls using some fancy data structures for k-nearest neighbor search, but you have to assure that your similarity measure hold triangle inequality.
If you are not satisfied with this answer, please specify more details of your problem and read about:
K-means (main disadvantage: you have to specify number of clusters)
Hierarchical clustering (slow computation time, at the top all images are in one cluster, you have to cut a dendrogram at proper distance)
Spectral clustering (for graphs, but I think it is too complicated for this easy problem)
I ended up solving the problem by using hierarchical clustering and then traversing each branch of the dendrogram top to bottom until I find a cluster where the distance is below a threshold. Worst case there is no such cluster but then I'll end up in a leaf of the dendrogram which means that element is in a cluster of its own.
Suppose there is a point cloud having 50 000 points in the x-y-z 3D space. For every point in this cloud, what algorithms or data strictures should be implemented to find k neighbours of a given point which are within a distance of [R,r]? Naive way is to go through each of the 49 999 points for each of the 50 000 points and do a metric testing. But this approach will take large time. Just like there is kd tree to find nearest neighbour in small time so is there some real-time DS/algo implementation out there to pre-process the point clouds to achieve the goal inn shortest time?
Your problem is part of the topic of Nearest Neighbor Search, or more precisely, k-Nearest Neighbor Search. The answer to your question depends on the data structure you are using to store the points. If you use R-trees or variants like R*-trees, and you are doing multiple searches on your database, you will likely find a substantial performance improvement in two or three-dimensional space compared with naive linear search. In higher dimensions, space partitioning schemes tend to underperform linear search.
As some answers already suggest for NN search you could use some tree algorithm like k-d-tree. There are implementations available for all programming languages.
If your description [R,r] suggests a hollow sphere you should compare one-time-testing (within interval) vs. two stages (test-for-outer and remove samples that pass test-for-inner).
You also did not mention performance requirements (timing or frame rate?) and your intended application (feasible approach?).
If you are using an ordinary Euclidean metric, you could go through the list three times and extract those points that within R in each dimension, essentially extracting the enclosing cube. Searching the resulting list would still be O(n^2), but on a much smaller n.
There are efficient algorithms (in average, for random data), see Nearest neighbor search.
Your approach is not efficient, yet simple.
Please read through, check you requirements and get back so we can help.
Here is the problem:
I have many sets of points, and want to come up with a function that can take one set and rank matches based on their similarity to the first. Scaling, translation, and rotation do not matter, and some points may be missing from any of the sets of points. The best match is the one that if scaled and translated in the ideal way has the least mean square error between points (maybe with a cap on penalty, or considering only the best fraction of points to handle missing points).
I am trying to come up with a good way to do this, and am wondering if there are any well known algorithms that can handle this type of problem? Just the name of something would be awesome! I lack a formal CSCI or math education, and am doing the best to teach myself.
A few things I have tried
The first thing that comes to mind is to normalize the points somehow, but I dont think that this is helpful because the missing points may throw things off.
The best way I can think of is to estimate a starting point by translating to match their centroids, scaling so that the largest distances from the centroid of the sets match. From there, do an A* search, scaling, rotating, and translating until I reach a maximum, and then compare the two sets. (I hope I am using the term A* correctly, I mean trying small translations and scalings and selecting the move giving the best match) I think this will find the global maximum most of the time, but is not guaranteed to. I am looking for a better way that will always be correct.
Thanks a ton for the help! It has been fun and interesting trying to figure this out so far, so I hope it is for you as well.
There's a very clever algorithm for identifying starfields. You find 4 points in a diamond shape and then using the two stars farthest apart you define a coordinate system locating the other two stars. This is scale and rotation invariant because the locations are relative to the first two stars. This forms a hash. You generate several of these hashes and use those to generate candidates. Once you have the candidates you look for ones where multiple hashes have the correct relationships.
This is described in a paper and a presentation on http://astrometry.net/ .
This paper may be useful: Shape Matching and Object Recognition Using Shape Contexts
Edit:
There is a couple of relatively simple methods to solve the problem:
To combine all possible pairs of points (one for each set) to nodes, connect these nodes where distances in both sets match, then solve the maximal clique problem for this graph. Since the maximal clique problem is NP-complete, the complexity is probably O(exp(n^2)), so if you have too many points, don't use this algorithm directly, use some approximation.
Use Generalised Hough transform to match two sets of points. This approach has less complexity (O(n^4)). But it is more complicated, so I cannot explain it here.
You can find the details in computer vision books, for example "Machine vision: theory, algorithms, practicalities" by E. R. Davies (2005).