How do I calculate tree edit distance? [closed] - algorithm

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I need to calculate the edit distance between trees. This paper describes an algorithm, but I can't make heads or tails out of it. Could you describe an applicable algorithm in a more approachable way? Pseudocode or code would both be helpful.

This Python library implements the Zhang-Shasha algorithm correctly: Zhang-Shasha: Tree edit distance in Python
It began as a direct port of the Java source listed in the currently accepted answer (the one with the tarball link), but that implementation is not correct and is nearly impossible to run at all.

I wrote an implementation (https://github.com/hoonto/jqgram.git) based on the existing PyGram Python code (https://github.com/Sycondaman/PyGram) for those of you who wish to use tree edit distance approximation using PQ-Gram algorithm in the browser and/or in Node.js.
The jqgram tree edit distance approximation module implements the PQ-Gram algorithm for both server-side and browser-side applications; O(n log n) time and O(n) space performant where n is the number of nodes. See the academic paper from which the algorithm comes: http://www.vldb2005.org/program/paper/wed/p301-augsten.pdf
The PQ-Gram approximation is much faster than obtaining the true edit distance via Zhang & Shasha, Klein, or Guha et. al, whom provide true edit distance algorithms that all perform minimum O(n^2) time and are therefore often unsuitable.
Often in real-world applications it is not necessary to know the true edit distance if a relative approximation of multiple trees to a known standard can be obtained. JavaScript, in the browser and now on the server with the advent of Node.js deal frequently with tree structures and end-user performance is usually critical in algorithm implementation and design; thus jqgram.
Example:
var jq = require("jqgram").jqgram;
var root1 = {
"thelabel": "a",
"thekids": [
{ "thelabel": "b",
"thekids": [
{ "thelabel": "c" },
{ "thelabel": "d" }
]},
{ "thelabel": "e" },
{ "thelabel": "f" }
]
}
var root2 = {
"name": "a",
"kiddos": [
{ "name": "b",
"kiddos": [
{ "name": "c" },
{ "name": "d" },
{ "name": "y" }
]},
{ "name": "e" },
{ "name": "x" }
]
}
jq.distance({
root: root1,
lfn: function(node){ return node.thelabel; },
cfn: function(node){ return node.thekids; }
},{
root: root2,
lfn: function(node){ return node.name; },
cfn: function(node){ return node.kiddos; }
},{ p:2, q:3, depth:10 },
function(result) {
console.log(result.distance);
});
Note that the lfn and cfn parameters specify how each tree should determine the node label names and the children array for each tree root independently so that you can do funky things like comparing an object to a browser DOM for example. All you need to do is provide those functions along with each root and jqgram will do the rest, calling your lfn and cfn provided functions to build out the trees. So in that sense it is (in my opinion anyway) much easier to use than PyGram. Plus, it’s JavaScript, so use it client or server-side!
Now one approach you can use is to use jqgram or PyGram to get a few trees that are close and then go on to use a true edit distance algorithm against a smaller set of trees. Why spend all the computation on trees you can already easily determine are very distant, or vice versa? So you can use jqgram to narrow down choices too.

Here's some Java source code (gzipped tarball at the bottom) for a tree edit distance algorithm that might be useful to you.
The page includes references and some slides that go through the "Zhang and Shasha" algorithm step-by-step and other useful links to get you up to speed.
The code in the link has bugs. Steve Johnson and tim.tadh have provided working Python code. See Steve Johnson's comment for more details.

Here you find Java implementations of tree edit distance algorithms:
Tree Edit Distance
In addition to Zhang&Shasha's algorithm of 1989, there are also tree edit distance implementations of more recent algorithms, including Klein 1998, Demaine et al. 2009, and the Robust Tree Edit Distance (RTED) algorithm by Pawlik&Augsten, 2011.

I made a simple Python wrapper (apted.py) for the APTED algorithm using jpype:
# To use, create a folder named lib next to apted.py, then put APTED.jar into it
import os, os.path, jpype
global distancePackage
distancePackage = None
global utilPackage
utilPackage = None
def StartJVM():
# from http://www.gossamer-threads.com/lists/python/python/379020
root = os.path.abspath(os.path.dirname(__file__))
jpype.startJVM(jpype.getDefaultJVMPath(),
"-Djava.ext.dirs=%s%slib" % (root, os.sep))
global distancePackage
distancePackage = jpype.JPackage("distance")
global utilPackage
utilPackage = jpype.JPackage("util")
def StopJVM():
jpype.shutdownJVM()
class APTED:
def __init__(self, delCost, insCost, matchCost):
global distancePackage
if distancePackage is None:
raise Exception("Need to call apted.StartJVM() first")
self.myApted = distancePackage.APTED(float(delCost), float(insCost), float(matchCost))
def nonNormalizedTreeDist(self, lblTreeA, lblTreeB):
return self.myApted.nonNormalizedTreeDist(lblTreeA.myLblTree, lblTreeB.myLblTree)
class LblTree:
def __init__(self, treeString):
global utilPackage
if utilPackage is None:
raise Exception("Need to call apted.StartJVM() first")
self.myLblTree = utilPackage.LblTree.fromString(treeString)
'''
# Example usage:
import apted
apted.StartJVM()
aptedDist = apted.APTED(delCost=1, insCost=1, matchCost=1)
treeA = apted.LblTree('{a}')
treeB = apted.LblTree('{b{c}}')
dist = aptedDist.nonNormalizedTreeDist(treeA, treeB)
print dist
# When you are done using apted
apted.StopJVM()
# For some reason it doesn't usually let me start it again
# and crashes the Python interpreter upon exit when I do
# this, so call only as needed.
'''

There are many variations of tree edit distance. If you can go with top-down tree edit distance, which limits insertions and deletes to the leaves, I suggest trying the following paper: Comparing Hierarchical Data in External Memory.
The implementation is a straightforward dynamic programming matrix with O(n2) cost.

There is a journal version of the ICALP2007 paper you refer to, An Optimal Decomposition Algorithm for Tree Edit Distance.
This version also has pseudocode.

Related

What is the time complexity performance of Scala's Vector data structure?

I know that most of the Vector methods are effectively O(1) (constant time) because of the tree they use, but I cannot find any information on the contains method. My first thought is that it would have to be O(n) to check all the elements but I am not sure.
Answering the question in the title, performance characteristics (2.13 docs version) of basic operations head, tail, apply, update, prepend, append, insert are all listed as eC for Vector:
eC The operation takes effectively constant time, but this might depend on some assumptions such as maximum length of a vector or distribution of hash keys.
You are correct contains is O(N), as there is no hashing or nothing else that would avoid the need to compare with all items. Still, if you want to be sure, it is best to check the implementation.
As finding the correct implementation in the library sources can be difficult because of many traits and overrides used to implement the containers, the best way to check this is the debugger. Use a code like:
val v = Vector(0, 1, 2)
v.contains(1)
Use the debugger to step into v.contains and the source you will see is:
def contains[A1 >: A](elem: A1): Boolean = exists (_ == elem)
If you are still not convinced at this point, some more "step into" will lead you to:
def exists(p: A => Boolean): Boolean = {
var res = false
while (!res && hasNext) res = p(next())
res
}

Find the longest chaining list?

I have a lot of objects. Some object can produce a chaining list, and all objects have a chaining continuation on condition. A exemple will explain more. Let's say I have this two objects :
[
{
"name": "ObjectA",
"produce": "ChainA",
"continuations": [
{ "ChainB": "ChainC" },
{ "ChainC": "ChainC" }
]
}
{
"name": "ObjectB",
"produce": null,
"continuations": [
{ "ChainA": "ChainB" }
]
}
]
I need to find the list :
ChainA (ObjectA) => ChainB (ObjectB) => ChainC (ObjectA) => ChainC (ObjectA)
I just can't find a better way that looping over and over again. Some of you have an idea ?
Thanks
This problem is NP-complete because if you had an efficient algorithm to solve it, then you could solve https://en.wikipedia.org/wiki/Longest_path_problem. So it is pretty hopeless.
Here is the reduction. Given a directed graph G, label the edges as objects and label the vertices as chains. Each "object" (edge) produces the "chain" that is the vertex it goes to. Each "object" (edge) has a single continuation, from the source "chain" (vertex) to the target "chain" (vertex).
Now apply the hypothetical algorithm. The longest chaining list will be the edges of the desired longest path for G.

A tic tac toe - that will make one move given an input

Assume computer is playing as C and the opponent is playing as O. The bot must be intelligent enough to win when provided an opportunity.
X represents a cell that is not taken.
Also assume that computer made the first move
for eg:
Input1
CCX
XOX
OXX
Output1
CCC
XOX
OXX
What i want to know is how to approach this problem. Is there a specific algorithm to follow?. If yes please clarify it to me!
Use the minimax algorithm.
Once that is implemented, define a simple heuristic, or evaluation function. It could go something like this:
function scoreBoard(board) {
if(board.isWin()) {
return 1;
}
else if(board.isTie()) {
return 0;
}
else {
return -1;
}
}
What you are looking for is the Minimax Algorithm.
You can find details here: http://en.wikipedia.org/wiki/Minimax
The idea behind the algorithm is to generate all possible states, assume that the computer is intelligent and chooses the best move(maximizing its gain). Your move should minimize the gain of the computer.

How to average several columns at once in Scalding?

As the final step on some computations with Scalding I want to compute several averages of the columns in a pipe. But the following code doesn't work
myPipe.groupAll { _average('col1,'col2, 'col3) }
Is there any way to compute such functions sum, max, average without doing several passes? I'm concerned about performance but maybe Scalding is smart enough to detect that programmatically.
This question was answered in the cascading-user forum. Leaving an answer here as a reference
myPipe.groupAll { _.average('col1).average('col2).average('col3) }
you can do size (aka count), average, and standardDev in one go using the function below.
// Find the count of boys vs. girls, their mean age and standard deviation.
// The new pipe contains "sex", "count", "meanAge" and "stdevAge" fields.
val demographics = people.groupBy('sex) { _.sizeAveStdev('age -> ('count, 'meanAge, 'stdevAge) ) }
finding max would require another pass though.

How can I detect common substrings in a list of strings

Given a set of strings, for example:
EFgreen
EFgrey
EntireS1
EntireS2
J27RedP1
J27GreenP1
J27RedP2
J27GreenP2
JournalP1Black
JournalP1Blue
JournalP1Green
JournalP1Red
JournalP2Black
JournalP2Blue
JournalP2Green
I want to be able to detect that these are three sets of files:
EntireS[1,2]
J27[Red,Green]P[1,2]
JournalP[1,2][Red,Green,Blue]
Are there any known ways of approaching this problem - any published papers I can read on this?
The approach I am considering is for each string look at all other strings and find the common characters and where differing characters are, trying to find sets of strings that have the most in common, but I fear that this is not very efficient and may give false positives.
Note that this is not the same as 'How do I detect groups of common strings in filenames' because that assumes that a string will always have a series of digits following it.
I would start here: http://en.wikipedia.org/wiki/Longest_common_substring_problem
There are links to supplemental information in the external links, including Perl implementations of the two algorithms explained in the article.
Edited to add:
Based on the discussion, I still think Longest Common Substring could be at the heart of this problem. Even in the Journal example you reference in your comment, the defining characteristic of that set is the substring 'Journal'.
I would first consider what defines a set as separate from the other sets. That gives you your partition to divide up the data, and then the problem is in measuring how much commonality exists within a set. If the defining characteristic is a common substring, then Longest Common Substring would be a logical starting point.
To automate the process of set detection, in general, you will need a pairwise measure of commonality which you can use to measure the 'difference' between all possible pairs. Then you need an algorithm to compute the partition that results in the overall lowest total difference. If the difference measure is not Longest Common Substring, that's fine, but then you need to determine what it will be. Obviously it needs to be something concrete that you can measure.
Bear in mind also that the properties of your difference measurement will bear on the algorithms that can be used to make the partition. For example, assume diff(X,Y) gives the measure of difference between X and Y. Then it would probably be useful if your measure of distance was such that diff(A,C) <= diff(A,B) + diff(B,C). And obviously diff(A,C) should be the same as diff(C,A).
In thinking about this, I also begin to wonder whether we could conceive of the 'difference' as a distance between any two strings, and, with a rigorous definition of the distance, could we then attempt some kind of cluster analysis on the input strings. Just a thought.
Great question! The steps to a solution are:
tokenizing input
using tokens to build an appropriate data structure. a DAWG is ideal, but a Trie is simpler and a decent starting point.
optional post-processing of the data structure for simplification or clustering of subtrees into separate outputs
serialization of the data structure to a regular expression or similar syntax
I've implemented this approach in regroup.py. Here's an example:
$ cat | ./regroup.py --cluster-prefix-len=2
EFgreen
EFgrey
EntireS1
EntireS2
J27RedP1
J27GreenP1
J27RedP2
J27GreenP2
JournalP1Black
JournalP1Blue
JournalP1Green
JournalP1Red
JournalP2Black
JournalP2Blue
JournalP2Green
^D
EFgre(en|y)
EntireS[12]
J27(Green|Red)P[12]
JournalP[12](Bl(ack|ue)|(Green|Red))
Something like that might work.
Build a trie that represents all your strings.
In the example you gave, there would be two edges from the root: "E" and "J". The "J" branch would then split into "Jo" and "J2".
A single strand that forks, e.g. E-n-t-i-r-e-S-(forks to 1, 2) indicates a choice, so that would be EntireS[1,2]
If the strand is "too short" in relation to the fork, e.g. B-A-(forks to N-A-N-A and H-A-M-A-S), we list two words ("banana, bahamas") rather than a choice ("ba[nana,hamas]"). "Too short" might be as simple as "if the part after the fork is longer than the part before", or maybe weighted by the number of words that have a given prefix.
If two subtrees are "sufficiently similar" then they can be merged so that instead of a tree, you now have a general graph. For example if you have ABRed,ABBlue,ABGreen,CDRed,CDBlue,CDGreen, you may find that the subtree rooted at "AB" is the same as the subtree rooted at "CD", so you'd merge them. In your output this will look like this: [left branch, right branch][subtree], so: [AB,CD][Red,Blue,Green]. How to deal with subtrees that are close but not exactly the same? There's probably no absolute answer but someone here may have a good idea.
I'm marking this answer community wiki. Please feel free to extend it so that, together, we may have a reasonable answer to the question.
try "frak" . It creates regex expression from set of strings. Maybe some modification of it will help you.
https://github.com/noprompt/frak
Hope it helps.
There are many many approaches to string similarity. I would suggest taking a look at this open-source library that implements a lot of metrics like Levenshtein distance.
http://sourceforge.net/projects/simmetrics/
You should be able to achieve this with generalized suffix trees: look for long paths in the suffix tree which come from multiple source strings.
There are many solutions proposed that solve the general case of finding common substrings. However, the problem here is more specialized. You're looking for common prefixes, not just substrings. This makes it a little simpler.
A nice explanation for finding longest common prefix can be found at
http://www.geeksforgeeks.org/longest-common-prefix-set-1-word-by-word-matching/
So my proposed "pythonese" pseudo-code is something like this (refer to the link for an implementation of find_lcp:
def count_groups(items):
sorted_list = sorted(items)
prefix = sorted_list[0]
ctr = 0
groups = {}
saved_common_prefix = ""
for i in range(1, sorted_list):
common_prefix = find_lcp(prefix, sorted_list[i])
if len(common_prefix) > 0: #we are still in the same group of items
ctr += 1
saved_common_prefix = common_prefix
prefix = common_prefix
else: # we must have switched to a new group
groups[saved_common_prefix] = ctr
ctr = 0
saved_common_prefix = ""
prefix = sorted_list[i]
return groups
For this particular example of strings to keep it extremely simple consider using simple word/digit -separation.
A non-digit sequence apparently can begin with capital letter (Entire). After breaking all strings into groups of sequences, something like
[Entire][S][1]
[Entire][S][2]
[J][27][Red][P][1]
[J][27][Green][P][1]
[J][27][Red][P][2]
....
[Journal][P][1][Blue]
[Journal][P][1][Green]
Then start grouping by groups, you can fairly soon see that prefix "Entire" is a common for some group and that all subgroups have S as headgroup, so only variable for those is 1,2.
For J27 case you can see that J27 is only leaf, but that it then branches at Red and Green.
So somekind of List<Pair<list, string>> -structure (composite pattern if I recall correctly).
import java.util.*;
class StringProblem
{
public List<String> subString(String name)
{
List<String> list=new ArrayList<String>();
for(int i=0;i<=name.length();i++)
{
for(int j=i+1;j<=name.length();j++)
{
String s=name.substring(i,j);
list.add(s);
}
}
return list;
}
public String commonString(List<String> list1,List<String> list2,List<String> list3)
{
list2.retainAll(list1);
list3.retainAll(list2);
Iterator it=list3.iterator();
String s="";
int length=0;
System.out.println(list3); // 1 1 2 3 1 2 1
while(it.hasNext())
{
if((s=it.next().toString()).length()>length)
{
length=s.length();
}
}
return s;
}
public static void main(String args[])
{
Scanner sc=new Scanner(System.in);
System.out.println("Enter the String1:");
String name1=sc.nextLine();
System.out.println("Enter the String2:");
String name2=sc.nextLine();
System.out.println("Enter the String3:");
String name3=sc.nextLine();
// String name1="salman";
// String name2="manmohan";
// String name3="rahman";
StringProblem sp=new StringProblem();
List<String> list1=new ArrayList<String>();
list1=sp.subString(name1);
List<String> list2=new ArrayList<String>();
list2=sp.subString(name2);
List<String> list3=new ArrayList<String>();
list3=sp.subString(name3);
sp.commonString(list1,list2,list3);
System.out.println(" "+sp.commonString(list1,list2,list3));
}
}

Resources