k-Nearest Neighbors where each point is its own class - algorithm

I'm trying to do recommendation based on feature similarity where my points in feature space represent unique classes. Essentially I have hundreds of unique items represented as low-dimensional feature vectors and I want to find the k-nearest neighbors in rank order for a new observation.
Conventionally you find the k neighbors and choose the class with the majority of representation therein. That won't work in my case considering each item has its own class.
Is kNN the wrong approach here? Is there a different family of algorithms more appropriate for this kind of problem?

Whether kNN is the right approach boils down to whether your classes are well characterized by a distance metric in your feature space. There is nothing inherently wrong with what you are proposing. You can simply associate a unique class with each training observation and then apply kNN with k = 1.

It sounds like you want to build a recommender system, where you recommend new products based on a product already purchased. This is not a classification problem, and so you shouldn't be treating it like one.
What method to use really depends on more details about your data, the amount, the feature representation, and other issues. Recommender systems are often a harder problem them simple classification with more nuanced issues. This coursera course may be more helpful to you.

Related

Machine learning method which is able to integrate prior knowledge in a decision tree

Does any of you know a machine learning method or combination of methods which makes it possible to integrate prior knowledge in the building process of a decision tree?
With "prior knowledge" I mean the information if a feature in a particular node is really responsible for the resulting classification or not. Imagine we only have a short period of time where our features are measured and in this period of time we have a correlation between features. If we now would measure the same features again, we probably would not get a correlation between those features, because it was just a coincidence that they are correlated. Unfortunately it is not possible to measure again.
The problem which arises with that is: the feature which is chosen by the algorithms to perform a split is not the feature which actually leads to the split in the real world. In other words the strongly correlated feature is chosen by the algorithm while the other feature is the one which should be chosen. That's why I want to set rules / causalities / constraints for the tree learning process.
"a particular feature in an already learned tree" - the typical decision tree has one feature per node, and therefore each feature can appear in many different nodes. Similarly, each leaf has one classification, but each classification may appear in multiple leafs. (And with a binary classifier, any non-trivial tree must have repeated classifications).
This means that you can enumerate all leafs and sort them by classification to get uniform subsets of leaves. For each such subset, you can analyze all paths from the root of the tree to see which features occurred. But this will be a large set.
"But in my case there are some features which are strongly correlated ... The feature which is choosen by the algorithms to perform a split is not the feature which actually leads to the split in the real world."
It's been said that every model is wrong, but some models are useful. If the features are indeed strongly correlated, choosing this "wrong" feature doesn't really affect the model.
You can of course just modify the split algorithm in tree building. Trivially, "if the remaining classes are A and B, use split S, else determine the split using algorithm C4.5" is a valid splitting algorithm that hardcodes pre-existing knowledge about two specific classes without being restricted to just that case.
But note that it might just be easier to introduce a combined class A+B in the decision tree, and then decide between A and B in postprocessing.

Normalization of a multi-dimensional space, what algorithm is this?

I'm not a trained statistician so I apologize for the incorrect usage of some words. I'm just trying to get some good results from the Weka Nearest Neighbor algorithms. I'll use some redundancy in my explanation as a means to try to get the concept across:
Is there a way to normalize a multi-dimensional space so that the distances between any two instances are always proportional to the effect on the dependent variable?
In other words I have a statistical data set and I want to use a "nearest neighbor" algorithm to find instances that are most similar to a specified test instance. Unfortunately my initial results are useless because two attributes that are very close in value weakly correlated to the dependent variable would incorrectly bias the distance calculation.
For example let's say you're trying to find the nearest-neighbor of a given car based on a database of cars: make, model, year, color, engine size, number of doors. We know intuitively that the make, model, and year have a bigger effect on price than the number of doors. So a car with identical color, door count, may not be the nearest neighbor to a car with different color/doors but same make/model/year. What algorithm(s) can be used to appropriately set the weights of each independent variable in the Nearest Neighbor distance calculation so that the distance will be statistically proportional (correlated, whatever) to the dependent variable?
Application: This can be used for a more accurate "show me products similar to this other product" on shopping websites. Back to the car example, this would have cars of same make and model bubbling up to the top, with year used as a tie-breaker, and then within cars of the same year, it might sort the ones with the same number of cylinders (4 or 6) ahead of the ones with the same number of doors (2 or 4). I'm looking for an algorithmic way to derive something similar to the weights that I know intuitively (make >> model >> year >> engine >> doors) and actually assign numerical values to them to be used in the nearest-neighbor search for similar cars.
A more specific example:
Data set:
Blue,Honda,6-cylinder
Green,Toyota,4-cylinder
Blue,BMW,4-cylinder
now find cars similar to:
Blue,Honda,4-cylinder
in this limited example, it would match the Green,Toyota,4-cylinder ahead of the Blue,Honda,6-cylinder because the two brands are statistically almost interchangeable and cylinder is a stronger determinant of price rather than color. BMW would match lower because that brand tends to double the price, i.e. placing the item a larger distance.
Final note: the prices are available during training of the algorithm, but not during calculation.
Possible you should look at Solr/Lucene for this aim. Solr provides a similarity search based field value frequency and it already has functionality MoreLikeThis for find similar items.
Maybe nearest neighbor is not a good algorithm for this case? As you want to classify discrete values it can become quite hard to define reasonable distances. I think an C4.5-like algorithm may better suit the application you describe. On each step the algorithm would optimize the information entropy, thus you will always select the feature that gives you the most information.
Found something in the IEEE website. The algorithm is called DKNDAW ("dynamic k-nearest-neighbor with distance and attribute weighted"). I couldn't locate the actual paper (probably needs a paid subscription). This looks very promising assuming that the attribute weights are computed by the algorithm itself.

Similarities Between Trees

I am working on a problem of Clustering of Results of Keyword Search on Graph. The results are in the form of Tree and I need to cluster those threes in group based on their similarities. Every node of the tree has two keys, one is the table name in the SQL database(semantic form) and second is the actual values of a record of that table(label).
I have used Zhang and Shasha, Klein, Demaine and RTED algorithms to find the Tree Edit Distance between the trees based on these two keys. All algorithms use no of deletion/insertion/relabel operation need to modify the trees to make them look same.
**I want some more matrices of to check the similarities between two trees e.g. Number of Nodes, average fan outs and more so that I can take a weighted average of these matrices to reach on a very good similarity matrix which takes into account both the semantic form of the tree (structure) and information contained in the tree(Labels at the node).
Can you please suggest me some way out or some literature which can be of some help?**
Can anyone suggest me some good paper
Even if you had the (pseudo-)distances between each pair of possible trees, this is actually not what you're after. You actually want to do unsupervised learning (clustering) in which you combine structure learning with parameter learning. The types of data structures you want to perform inference on are trees. To postulate "some metric space" for your clustering method, you introduce something that is not really necessary. To find the proper distance measure is a very difficult problem. I'll point in different directions in the following paragraphs and hope they can help you on your way.
The following is not the only way to represent this problem... You can see your problem as Bayesian inference over all possible trees with all possible values at the tree nodes. You probably would have some prior knowledge on what kind of trees are more likely than others and/or what kind of values are more likely than others. The Bayesian approach would allow you to define priors for both.
One article you might like to read is "Learning with Mixtures of Trees" by Meila and Jordan, 2000 (pdf). It explains that it is possible to use a decomposable prior: the tree structure has a different prior from the values/parameters (this of course means that there is some assumption of independence at play here).
I know you were hinting at heuristics such as the average fan-out etc., but you might find it worthwhile to check out these new applications of Bayesian inference. Note, for example that within nonparametric Bayesian method it is also feasible to reason about infinite trees, as done e.g. by Hutter, 2004 (pdf)!

Ways to determine a group of units in RTS

Looking for an algorithm that can be used to determine groups of units that move together as a squad in a real time strategy game like StarCraft. The direction that I am currently look at is a clustering algorithm but having a hard time finding which one would work best since units are moving as a group not just standing still. Any help would be great.
K-means is not the best choice, as it requires you to specify the number of clusters you expect to find. Some might contain single objects then.
I recommend adapting DBSCAN. In particular, the generalized version GDBSCAN.
For this, you need to define what constitutes the neighborhood of a unit - say, any other unit within a range of 2 that is belonging to the same player and moving approximately in the same direction (up to a certain delta threshold in x and y velocity).
Next, you need to specify when you consider units to start forming an initial cluster, called "core point". Say that is a minimum of 3 units.
Then using DBSCAN is quite basic, and should give you good results. You need to fine-tune the parameters a bit. Things like this minimum size are clearly an input parameter, and depend on your use case. So is the neighborhood definition: you are looking for groups that move into the same direction, this information needs to be put into the algorithm somehow. With GDBSCAN this is trivial, by adjusting the neighborhood definition.
You may want to look at a number of classification algorithms, like k-Nearest Neighbor or Support Vector Machines
Kmeans algorithm is quite simple and standard approach. You can check if it works:

Group detection in data sets

Assume a group of data points, such as one plotted here (this graph isn't specific to my problem, but just used as a suitable example):
Inspecting the scatter graph visually, it's fairly obvious the data points form two 'groups', with some random points that do not obviously belong to either.
I'm looking for an algorithm, that would allow me to:
start with a data set of two or more dimensions.
detect such groups from the dataset without prior knowledge on how many (or if any) might be there
once the groups have been detected, 'ask' the model of groups, if a new sample point seems to fit to any of the groups
There are many choices, but if you are interested in the probability that a new data point belongs to a particular mixture, I would use a probabilistic approach such as Gaussian mixture modeling either estimated by maximum likelihood or Bayes.
Maximum likelihood estimation of mixtures models is implemented in Matlab.
Your requirement that the number of components is unknown makes your model more complex. The dominant probabilistic approach is to place a Dirichlet Process prior on the mixture distribution and estimate by some Bayesian method. For instance, see this paper on infinite Gaussian mixture models. The DP mixture model will give you inference over the number of components and the components each elements belong to, which is exactly what you want. Alternatively you could perform model selection on the number of components, but this is generally less elegant.
There are many implementation of DP mixture models models, but they may not be as convenient. For instance, here's a Matlab implementation.
Your graph suggests you are an R user. In that case, if you are looking for prepacked solutions, the answer to your question lies on this Task View for cluster analysis.
I think you are looking for something along the lines of a k-means clustering algorithm.
You should be able to find adequate implementations in most general purpose languages.
You need one of clustering algorithms. All of them can be devided in 2 groups:
you specify number of groups (clusters) - 2 clusters in your example
algorithm try to guess correct number of clusters by itself
If you want algorithm of 1st type then K-Means is what you really need.
If you want algorithm of 2nd type then you probably need one of hierarchical clustering algorithms. I haven't ever implement any of them. But I see an easy way to improve K-means in such way thay it will be unnecessary to specify number of clusters.

Resources