Node comparison for hierarchical tree

Node comparison for hierarchical tree - algorithm

I have an algorithm to classify some websites on a hierarchical tree. But I have trouble to test on the algorithm. Say the result suppose to be at one node, but the algorithm give me a different node, how do I measure the difference between the two node.
More specifically, I have an algorithm to categorize webpage to a hierarchical category. For example, https://stackoverflow.com/ is under computer>development>development Q&A. I am trying to test how the algorithm works. And trying to come up with a way to compare two nodes on a built tree.
I have some ideas now: 1. Calculate the distance of the two nodes. 2. Get the path of two node, as two lists and compare the lists.

Related

Quicker algorithm

I have undirected, unweighed graph. I have a file that contains list of pairs (connected nodes). Each node may have any number of neighbours. I have to find one or all possible ways to traverse from a specified node to another.
I tried the depth first search - it works pretty well but my friends suggest that it may be pretty slowly. What other algorithms would you suggest? Could you please provide example pseudo code for them?

Comparing trees with different leaf sets (different number and label of leaf nodes)

I have hierarchical data from file/ folder structures which i use to build trees. I am now trying to compare these trees with random ones and amongst themselves.
To compare to random trees I can preserve the number and label of the leaf nodes and use traditional tree distance metrics (For instance Robinson-Foulds distance). Nonetheless to compare different trees from different data ( with different number of leaves and labels ) I have no idea which metric/ algorithm to use. Any suggestions?
thanks!
PS- the goal with the comparison would be to establish how similar is the topology between these trees and see which clusters may exist (and hence add some piece of evidence on the thoughts of the generating mechanisms behind the folder structure).

What's the purpose of BFS and DFS?

I've learned how these algorithms work, but what are they used for?
Do we use them to:
find a certain node in a graph or
to find a shortest path or
to find a cycle in a graph
?
Both of them just visit all the nodes and mark them visited, and I don't see the point of doing that.
I am sort of lost here what I am learning.

BFS and DFS are graph search algorithms that can be used for a variety of different purposes.
One common application of the two search techniques is to identify all nodes that are reachable from a given starting node. For example, suppose that you have a collection of computers that each are networked to a handful of other computers. By running a BFS or DFS from a given node, you will discover all other computers in the network that the original computer is capable of directly or indirectly talking to. These are the computers that come back marked.
BFS specifically can be used to find the shortest path between two nodes in an unweighted graph. Suppose, for example, that you want to send a packet from one computer in a network to another, and that the computers aren't directly connected to one another. Along what route should you send the packet to get it to arrive at the destination as quickly as possible? If you run a BFS and at each iteration have each node store a pointer to its "parent" node, you will end up finding route from the start node to each other node in the graph that minimizes the number of links that have to be traversed to reach the destination computer.
DFS is often used as a subroutine in more complex algorithms. For example, Tarjan's algorithm for computing strongly-connected components is based on depth-first search. Many optimizing compiler techniques run a DFS over an appropriately-constructed graph in order to determine in which order to apply a specific series of operations. Depth-first search can also be used in maze generation: by taking a grid of nodes and linking each node to its neighbors, you can construct a graph representing a grid. Running a random depth-first search over this graph then produces a maze that has exactly one solution.
This is by no means an exhaustive list. These algorithms have all sorts of applications, and as you start to explore more advanced algorithms you will often find yourself relying on DFS and BFS as building blocks. It's similar to sorting - sorting by itself isn't all that interesting, but being able to sort a list of values is enormously useful as a subroutine in more complex algorithms.
Hope this helps!

How to modify preorder tree traversal algorithm to handle nodes with multiple parents?

I've been searching for a while now and can't seem to find an alternative solution. I need the tree traversal algorithm in such a way that a node can have more than 1 parent, if it's possible (found a great article here: Storing Hierarchical Data in a Database). Are there any algorithms so that, starting from a root node, we can determine the sequence and dependencies of nodes (currently reading topological sorting)?

The structure you described isn't a tree, it's a directed graph. As it would be suitable for hierarchical drawing you might be tempted to think of it as a tree (which itself is an acyclic connected graph).
Typical traversal algorithms for graphs are depth-first and breadth-first. The graph implementation is only different as it records the nodes it has already visited in order to avoid visiting certain nodes multiple times. However, if your data structure guarantees that it's acyclic, you can use tree algorithms on your graph by simply treating "parents" as "children".
I made a simple sketch to illustrate what I mean (the perfect chance to try Google Docs' new drawing feature):
As you see, it's possible to treat any graph that has an acyclic directed form as a tree and apply tree algorithms on it. As soon as you can't guarantee this property you'll have to go for dedicated graph algorithms.

A tree is basically a directed unweighted graph, where each vertice has N or less edges, and no cycles can happen.
If your'e certain there are no cycles in your tree, you could just treat a parent as another child of the specified node, and preform a preorder traversal normally.
However, if cycles might happen, you need graph algorithms.
Specifically: Breadth first search.

Just checking for maybe a simple case: can the two parents have different parents?
If no you could turn them into single node (conceptually) and have a tree again.
Otherwise you will have to split the child node and duplicate a branch for the other parent.
(This can of course lead to inconsistency and/or inneficient algorithms later, depending if you will need to maintain the data structure).
The above options hold if you insist on having the tree structure, which by definition can have only one parent.
So maybe you need to step back and explain what are you trying to accomplish and why it must be a tree structure if nodes can have two parents.

You aren't describing a tree here. You can NOT call your graph a tree.
A tree is an undirected graph without cycles. Parent/child relationship is NOT an interpretation of directions drawn on the edges. They are the result of naming one vertex the root.
We name a vertex "parent" to current, because it's the next one to the path to root. All other vertexes adjacent to current one are "children".
You can't just lay out an arbitrary graph in such a way that "parents" are "above" or "point to vertex", and children are "below" or "vertex points to them". A tree is a tree because a root is picked. What you depict in your question is not a tree. And tree traversal algorithms are NOT applicable to traversing arbitrary graphs.
There are several graph traversal algorithms, such as breadth-first search or depth-first search (check side notes in those pages for more). Use them instead of trying to tie your full-featured graph into your knowledge about trees.

Graph Isomorphism

Is there an algorithm or heuristics for graph isomorphism?
Corollary: A graph can be represented in different different drawings.
What s the best approach to find different drawing of a graph?

It is a hell of a problem.
In general, the basic idea is to simplify the graph into a canonical form, and then perform comparison of canonical forms. Spanning trees are generated with this objective, but spanning trees are not unique, so you need to have a canonical way to create them.
After you have canonical forms, you can perform isomorphism comparison (relatively) easy, but that's just the start, since non-isomorphic graphs can have the same spanning tree. (e.g. think about a spanning tree T and a single addition of an edge to it to create T'. These two graphs are not isomorph, but they have the same spanning tree).
Other techniques involve comparing descriptors (e.g. number of nodes, number of edges), which can produce false positive in general.
I suggest you to start with the wiki page about the graph isomorphism problem. I also have a book to suggest: "Graph Theory and its applications". It's a tome, but worth every page.
As from you corollary, every possible spatial distribution of a given graph's vertexes is an isomorph. So two isomorph graphs have the same topology and they are, in the end, the same graph, from the topological point of view. Another matter is, for example, to find those isomorph structures enjoying particular properties (e.g. with non crossing edges, if exists), and that depends on the properties you want.

One of the best algorithms out there for finding graph isomorphisms is VF2.
I've written a high-level overview of VF2 as applied to chemistry - where it is used extensively. The post touches on the differences between VF2 and Ullmann. There is also a from-scratch implementation of VF2 written in Java that might be helpful.

A very similar problem - graph automorphism - can be solved by saucy, which is available in source code. This finds all symmetries of a graph. If you have two graphs, join them into one and any isomorphism can be discovered as an automorphism of the join.
Disclaimer: I am one of co-authors of saucy.

There are algorithms to do this -- however, I have not had cause to seriously investigate them as of yet. I believe Donald Knuth is either writing or has written on this subject in his Art of Computing series during his second pass at (re)writing them.
As for a simple way to do something that might work in practice on small graphs, I would recommend counting degrees, then for each vertex, also note the set of degrees for those vertexes that are adjacent. This will then give you a set of potential vertex isomorphisms for each point. Then just try all those (via brute force, but choosing the vertexes in increasing order of potential vertex isomorphism sets) from this restricted set. Intuitively, most graph isomorphism can be practically computed this way, though clearly there would be degenerate cases that might take a long time.

I recently came across the following paper : http://arxiv.org/abs/0711.2010
This paper proposes "A Polynomial Time Algorithm for Graph Isomorphism"

My project - Griso - at sf.net: http://sourceforge.net/projects/griso/ with this description:
Griso is a graph isomorphism testing utility written in C++. It is based on my own POLYNOMIAL-TIME (in this point the salt of the project) algorithm. See Griso's sample input/output on http://funkybee.narod.ru/graphs.htm page.

nauty and Traces
nauty and Traces are programs for computing automorphism groups of graphs and digraphs [*]. They can also produce a canonical label. They are written in a portable subset of C, and run on a considerable number of different systems.
AutGroupGraph command in GRAPE's package of GAP.
bliss: another symmetry and canonical labeling program.
conauto: a graph ismorphism package.

As for heuristics: i've been fantasising about a modified Ullmann's algorithm, where you don't only use breadth first search but mix it with depth first search the way, that first you use breadth first search intensively, than you set a limit for breadth analysis and go deeper after checking a few neighbours, and you lower the breadh every step at some amount. This is practically how i find my way on a map: first locate myself with breadth first search, then search the route with depth first search - largely, and this is the best evolution of my brain has ever invented. :) On the long term some intelligence may be added for increasing breadth first search neighbour count at critical vertexes - for example where there are a large number of neighbouring vertexes with the same edge count. Like checking your actual route sometimes with the car (without a gps).

I've found out that the algorithm belongs in the category of k-dimension Weisfeiler-Lehman algorithms, and it fails with regular graphs. For more here:
http://dabacon.org/pontiff/?p=4148
Original post follows:
I've worked on the problem to find isomorphic graphs in a database of graphs (containing chemical compositions).
In brief, the algorithm creates a hash of a graph using the power iteration method. There might be false positive hash collisions but the probability of that is exceedingly small (i didn't had any such collisions with tens of thousands of graphs).
The way the algorithm works is this:
Do N (where N is the radius of the graph) iterations. On each iteration and for each node:
Sort the hashes (from the previous step) of the node's neighbors
Hash the concatenated sorted hashes
Replace node's hash with newly computed hash
On the first step, a node's hash is affected by the direct neighbors of it. On the second step, a node's hash is affected by the neighborhood 2-hops away from it. On the Nth step a node's hash will be affected by the neighborhood N-hops around it. So you only need to continue running the Powerhash for N = graph_radius steps. In the end, the graph center node's hash will have been affected by the whole graph.
To produce the final hash, sort the final step's node hashes and concatenate them together. After that, you can compare the final hashes to find if two graphs are isomorphic. If you have labels, then add them (on the first step) in the internal hashes that you calculate for each node.
There is more background here:
https://plus.google.com/114866592715069940152/posts/fmBFhjhQcZF
You can find the source code of it here:
https://github.com/madgik/madis/blob/master/src/functions/aggregate/graph.py

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio