Similarities Between Trees - algorithm

I am working on a problem of Clustering of Results of Keyword Search on Graph. The results are in the form of Tree and I need to cluster those threes in group based on their similarities. Every node of the tree has two keys, one is the table name in the SQL database(semantic form) and second is the actual values of a record of that table(label).
I have used Zhang and Shasha, Klein, Demaine and RTED algorithms to find the Tree Edit Distance between the trees based on these two keys. All algorithms use no of deletion/insertion/relabel operation need to modify the trees to make them look same.
**I want some more matrices of to check the similarities between two trees e.g. Number of Nodes, average fan outs and more so that I can take a weighted average of these matrices to reach on a very good similarity matrix which takes into account both the semantic form of the tree (structure) and information contained in the tree(Labels at the node).
Can you please suggest me some way out or some literature which can be of some help?**
Can anyone suggest me some good paper

Even if you had the (pseudo-)distances between each pair of possible trees, this is actually not what you're after. You actually want to do unsupervised learning (clustering) in which you combine structure learning with parameter learning. The types of data structures you want to perform inference on are trees. To postulate "some metric space" for your clustering method, you introduce something that is not really necessary. To find the proper distance measure is a very difficult problem. I'll point in different directions in the following paragraphs and hope they can help you on your way.
The following is not the only way to represent this problem... You can see your problem as Bayesian inference over all possible trees with all possible values at the tree nodes. You probably would have some prior knowledge on what kind of trees are more likely than others and/or what kind of values are more likely than others. The Bayesian approach would allow you to define priors for both.
One article you might like to read is "Learning with Mixtures of Trees" by Meila and Jordan, 2000 (pdf). It explains that it is possible to use a decomposable prior: the tree structure has a different prior from the values/parameters (this of course means that there is some assumption of independence at play here).
I know you were hinting at heuristics such as the average fan-out etc., but you might find it worthwhile to check out these new applications of Bayesian inference. Note, for example that within nonparametric Bayesian method it is also feasible to reason about infinite trees, as done e.g. by Hutter, 2004 (pdf)!

Related

Machine learning method which is able to integrate prior knowledge in a decision tree

Does any of you know a machine learning method or combination of methods which makes it possible to integrate prior knowledge in the building process of a decision tree?
With "prior knowledge" I mean the information if a feature in a particular node is really responsible for the resulting classification or not. Imagine we only have a short period of time where our features are measured and in this period of time we have a correlation between features. If we now would measure the same features again, we probably would not get a correlation between those features, because it was just a coincidence that they are correlated. Unfortunately it is not possible to measure again.
The problem which arises with that is: the feature which is chosen by the algorithms to perform a split is not the feature which actually leads to the split in the real world. In other words the strongly correlated feature is chosen by the algorithm while the other feature is the one which should be chosen. That's why I want to set rules / causalities / constraints for the tree learning process.
"a particular feature in an already learned tree" - the typical decision tree has one feature per node, and therefore each feature can appear in many different nodes. Similarly, each leaf has one classification, but each classification may appear in multiple leafs. (And with a binary classifier, any non-trivial tree must have repeated classifications).
This means that you can enumerate all leafs and sort them by classification to get uniform subsets of leaves. For each such subset, you can analyze all paths from the root of the tree to see which features occurred. But this will be a large set.
"But in my case there are some features which are strongly correlated ... The feature which is choosen by the algorithms to perform a split is not the feature which actually leads to the split in the real world."
It's been said that every model is wrong, but some models are useful. If the features are indeed strongly correlated, choosing this "wrong" feature doesn't really affect the model.
You can of course just modify the split algorithm in tree building. Trivially, "if the remaining classes are A and B, use split S, else determine the split using algorithm C4.5" is a valid splitting algorithm that hardcodes pre-existing knowledge about two specific classes without being restricted to just that case.
But note that it might just be easier to introduce a combined class A+B in the decision tree, and then decide between A and B in postprocessing.

k-Nearest Neighbors where each point is its own class

I'm trying to do recommendation based on feature similarity where my points in feature space represent unique classes. Essentially I have hundreds of unique items represented as low-dimensional feature vectors and I want to find the k-nearest neighbors in rank order for a new observation.
Conventionally you find the k neighbors and choose the class with the majority of representation therein. That won't work in my case considering each item has its own class.
Is kNN the wrong approach here? Is there a different family of algorithms more appropriate for this kind of problem?
Whether kNN is the right approach boils down to whether your classes are well characterized by a distance metric in your feature space. There is nothing inherently wrong with what you are proposing. You can simply associate a unique class with each training observation and then apply kNN with k = 1.
It sounds like you want to build a recommender system, where you recommend new products based on a product already purchased. This is not a classification problem, and so you shouldn't be treating it like one.
What method to use really depends on more details about your data, the amount, the feature representation, and other issues. Recommender systems are often a harder problem them simple classification with more nuanced issues. This coursera course may be more helpful to you.

create decision tree from data

I'm trying to create decision tree from data. I'm using the tree for guess-the-animal-game kind of application. User answers questions with yes/no and program guesses the answer. This program is for homework.
I don't know how to create decision tree from data. I have no way of knowing what will be the root node. Data will be different every time. I can't do it by hand. My data is like this:
Animal1: property1, property3, property5
Animal2: property2, property3, property5, property6
Animal3: property1, property6
etc.
I searched stackoverflow and i found ID3 and C4.5 algorithms. But i don't know if i should use them.
Can someone direct me, what algorithm should i use, to build decision tree in this situation?
I searched stackoverflow and i found ID3 and C4.5 algorithms. But i
don't know if i should use them.
Yes, you should. They are very commonly used decision trees, and have some nice open source implementations for them. (Weka's J48 is an example implementation of C4.5)
If you need to implement something from scratch, implementing a simple decision tree is fairly simple, and is done iteratively:
Let the set of labled samples be S, with set of properties P={p1,p2,...,pk}
Choose a property pi
Split S to two sets S1,S2 - S1 holds pi, and S2 do not. Create two children for the current node, and move S1 and S2 to them respectively
Repeat for S'=S1, S'=S2 for each of the subsets of samples, if they are not empty.
Some pointers:
At each iteration you basically split the current data to 2 subsets, the samples that hold pi, and the data that does not. You then create two new nodes, which are the current node's children, and repeat the process for each of them, each with the relevant subset of data.
A smart algorithm chooses the property pi (in step 2) in a way that minimizes the tree's height as much as it can (finding the best solution is NP-Hard, but there are greedy approaches to minimize entropy, for example).
After the tree is created, some pruning to it is done, in order to avoid overfitting.
A simple extension of this algorithm is using multiple decision trees that work seperately - this is called Random Forests, and is empirically getting pretty good results usually.

Performance of an A* search implemented in Clojure

I have implemented an A* search algorithm for finding a shortest path between two states.
Algorithm uses a hash-map for storing best known distances for visited states. And one hash-map for storing child-parent relationships needed for reconstruction of the shortest path.
Here is the code. Implementation of the algorithm is generic (states only need to be "hashable" and "comparable") but in this particular case states are pairs (vectors) of ints [x y] and they represent one cell in a given heightmap (cost for jumping to neighboring cell depends on the difference in heights).
Question is whether it's possible to improve performance and how? Maybe by using some features from 1.2 or future versions, by changing logic of the algorithm implementation (e.g. using different way to store path) or changing state representation in this particular case?
Java implementation runs in an instant for this map and Clojure implementation takes about 40 seconds. Of course, there are some natural and obvious reasons for this: dynamic typing, persistent data structures, unnecessary (un)boxing of primitive types...
Using transients didn't make much difference.
Using priority-map instead of sorted-set
I first used sorted-set for storing open nodes (search frontier), switching to priority-map improved performance: now it takes 15-20 seconds for this map (before it took 40s).
This blog post was very helpful. And "my" new implementation is pretty much the same.
New a*-search can be found here.
I don't know Clojure, but I can give you some general advice about improving the performance of Vanilla A*.
Consider implementing IDA*, which is a variant of A* that uses less memory, if it's suitable for your domain.
Try a different heuristic. A good heuristic can have a significant impact on the number of node expansions required.
Use a cache, Often called a "transposition table" in search algorithms. Since search graphs are usually Directed Acyclic Graphs and not true trees, you can end up repeating the search of a state more than once; a cache to remember previously-searched nodes reduces node expansions.
Dr. Jonathan Schaeffer has some slides on this subject:
http://webdocs.cs.ualberta.ca/~jonathan/Courses/657/Notes/10.Single-agentSearch.pdf
http://webdocs.cs.ualberta.ca/~jonathan/Courses/657/Notes/11.Evaluations.pdf

Group detection in data sets

Assume a group of data points, such as one plotted here (this graph isn't specific to my problem, but just used as a suitable example):
Inspecting the scatter graph visually, it's fairly obvious the data points form two 'groups', with some random points that do not obviously belong to either.
I'm looking for an algorithm, that would allow me to:
start with a data set of two or more dimensions.
detect such groups from the dataset without prior knowledge on how many (or if any) might be there
once the groups have been detected, 'ask' the model of groups, if a new sample point seems to fit to any of the groups
There are many choices, but if you are interested in the probability that a new data point belongs to a particular mixture, I would use a probabilistic approach such as Gaussian mixture modeling either estimated by maximum likelihood or Bayes.
Maximum likelihood estimation of mixtures models is implemented in Matlab.
Your requirement that the number of components is unknown makes your model more complex. The dominant probabilistic approach is to place a Dirichlet Process prior on the mixture distribution and estimate by some Bayesian method. For instance, see this paper on infinite Gaussian mixture models. The DP mixture model will give you inference over the number of components and the components each elements belong to, which is exactly what you want. Alternatively you could perform model selection on the number of components, but this is generally less elegant.
There are many implementation of DP mixture models models, but they may not be as convenient. For instance, here's a Matlab implementation.
Your graph suggests you are an R user. In that case, if you are looking for prepacked solutions, the answer to your question lies on this Task View for cluster analysis.
I think you are looking for something along the lines of a k-means clustering algorithm.
You should be able to find adequate implementations in most general purpose languages.
You need one of clustering algorithms. All of them can be devided in 2 groups:
you specify number of groups (clusters) - 2 clusters in your example
algorithm try to guess correct number of clusters by itself
If you want algorithm of 1st type then K-Means is what you really need.
If you want algorithm of 2nd type then you probably need one of hierarchical clustering algorithms. I haven't ever implement any of them. But I see an easy way to improve K-means in such way thay it will be unnecessary to specify number of clusters.

Resources