Unordered Tree Pattern Matching Algorithm - algorithm

I am trying to find a reasonable algorithm find the first tree pattern matching in unordered, rooted trees. According to some research I have come across, this problem is NP-Complete. I don't need to find every pattern match, I just need to find any pattern matching that exists. Preferably, I would rather not have to perform "deletions" on my tree (nor do I want to make a copy to delete nodes from).
Another thing to note is that the tree will be updated between tree matching queries, so I'm also hoping that there may be some algorithms that take advantage of this fact, possibly using an online approach that keeps track of previous partial matches in the tree to optimize a future match.
Is there a straightforward algorithm that can solve this problem given the criteria I mentioned, but one that is still better than the pure naive brute force approach?
Notes, my problem is similar to this previously asked question, but that question is specific to ordered trees.

According to http://www.sciencedirect.com/science/article/pii/S1570866704000644 the problem that is NP-complete is tree inclusion. That means that the tree can fit in potentially skipping generations. So, for instance, a tree with one root and 1000 leaves could fit into a tree which branches in 2 10x. And because this problem is NP-complete, you cannot do fundamentally better exponential growth as the trees grow.
But you can reduce that exponent and do much better than brute force. For example for each node in the tree record the maximum depth below it and total number of descendants. As you try to fit one tree into the other, stop searching whenever you're trying to fit a subtree with too much depth or too many children. This will let you avoid following a lot of lost causes.
You can also use dynamic programming to help. What you try to do is store for each pair of nodes from the two trees whether or not the subtree below one can be mapped to the other. When you're looking at whether a can go to b what you first do is map the children of a in to the children of b. If any can't go, then you know that the answer is no. If all can go, then sort the children of a from fitting in the least to the most places. Now do a brute force search for how to fit the one into the other. You'll tend to find your dead ends very quickly with this way of organizing the search.
However if the trees are large, if the one won't fit into the other you can spend a very, very long time figuring that fact out.

Related

Difference between 'backtracking' and 'branch and bound'

In backtracking we use both bfs and dfs. Even in branch and bound we use both bfs and dfs in additional to least cost search.
so when do we use backtracking and when do we use branch and bound
Does using branch and bound decreases time complexity?
What is Least cost search in Branch and Bound?
Backtracking
It is used to find all possible solutions available to a problem.
It traverses the state space tree by DFS(Depth First Search) manner.
It realizes that it has made a bad choice & undoes the last choice by backing up.
It searches the state space tree until it has found a solution.
It involves feasibility function.
Branch-and-Bound
It is used to solve optimization problem.
It may traverse the tree in any manner, DFS or BFS.
It realizes that it already has a better optimal solution that the pre-solution leads to so it abandons that pre-solution.
It completely searches the state space tree to get optimal solution.
It involves a bounding function.
Backtracking
Backtracking is a general concept to solve discrete constraint satisfaction problems (CSPs). It uses DFS. Once it's at a point where it's clear that the solution cannot be constructed, it goes back to the last point where there was a choice. This way it iterates all potential solutions, maybe aborting sometimes a bit earlier.
Branch-and-Bound
Branch-and-Bound (B&B) is a concept to solve discrete constrained optimization problems (COPs). They are similar to CSPs, but besides having the constraints they have an optimization criterion. In contrast to backtracking, B&B uses Breadth-First Search.
One part of the name, the bound, refers to the way B&B prunes the space of possible solutions: It gets a heuristic which gets an upper bound. If this cannot be improved, a sup-tree can be discarded.
Besides that, I don't see a difference to Backtracking.
Other Sources
There are other answers on the web which make very different statements:
Branch-and-Bound is backtracking with pruning (source)
Backtracking
Backtracking is a general algorithm for finding all (or some) solutions to some computational problems, notably constraint satisfaction problems, that incrementally builds candidates to the solutions, and abandons each partial candidate c ("backtracks") as soon as it determines that c cannot possibly be completed to a valid solution.
It enumerates a set of partial candidates that, in principle, could be completed in various ways to give all the possible solutions to the given problem. The completion is done incrementally, by a sequence of candidate extension steps.
Conceptually, the partial candidates are represented as the nodes of a tree structure, the potential search tree. Each partial candidate is the parent of the candidates that differ from it by a single extension step, the leaves of the tree are the partial candidates that cannot be extended any further.
It traverses this search tree recursively, from the root down, in depth-first order (DFS). It realizes that it has made a bad choice & undoes the last choice by backing up.
For more details: Sanjiv Bhatia's presentation on Backtracking for UMSL.
Branch And Bound
A branch-and-bound algorithm consists of a systematic enumeration of candidate solutions by means of state space search: the set of candidate solutions is thought of as forming a rooted tree with the full set at the root.
The algorithm explores branches of this tree, which represent subsets of the solution set. Before enumerating the candidate solutions of a branch, the branch is checked against upper and lower estimated bounds on the optimal solution, and is discarded if it cannot produce a better solution than the best one found so far by the algorithm.
It may traverse the tree in any following manner:
BFS (Breath First Search) or (FIFO) Branch and Bound
D-Search or (LIFO) Branch and Bound
Least Count Search or (LC) Branch and Bound
For more information: Sanjiv Bhatia's presentation on Branch and Bound for UMSL.
Backtracking:
-optimal solution is selected from solution space.
-traversed through DFS.
Branch and Bound:
-BFS traversal.
-here only fruitful solutions are generated rather than generating all possible ones.

If memoization is top-down depth-first, and DP is bottom-up breadth-first, what are the top-down breadth-first / bottom-up depth-first equivalents?

I just read this short post about mental models for Recursive Memoization vs Dynamic Programming, written by professor Krishnamurthi. In it, Krishnamurthi represents memoization's top-down structure as a recursion tree, and DP's bottom-up structure as a DAG where the source vertices are the first – likely smallest – subproblems solved, and the sink vertex is the final computation (essentially the graph is the same as the aforementioned recursive tree, but with all the edges flipped). Fair enough, that makes perfect sense.
Anyways, towards the end he gives a mental exercise to the reader:
Memoization is an optimization of a top-down, depth-first computation
for an answer. DP is an optimization of a bottom-up, breadth-first
computation for an answer.
We should naturally ask, what about
top-down, breadth-first
bottom-up, depth-first
Where do they fit into
the space of techniques for avoiding recomputation by trading off
space for time?
Do we already have names for them? If so, what?, or
Have we been missing one or two important tricks?, or
Is there a reason we don't have names for these?
However, he stops there, without giving his thoughts on these questions.
I'm lost, but here goes:
My interpretation is that a top-down, breadth-first computation would require a separate process for each function call. A bottom-up, depth-first approach would somehow piece together the final solution, as each trace reaches the "sink vertex". The solution would eventually "add up" to the right answer once all calls are made.
How off am I? Does anyone know the answer to his three questions?
Let's analyse what the edges in the two graphs mean. An edge from subproblem a to b represents a relation where a solution of b is used in the computation of a and must be solved before it. (The other way round in the other case.)
Does topological sort come to mind?
One way to do a topological sort is to perform a Depth First Search and on your way out of every node, process it. This is essentially what Recursive memoization does. You go down Depth First from every subproblem until you encounter one that you haven't solved (or a node you haven't visited) and you solve it.
Dynamic Programming, or bottom up - breadth first problem solving approach involves solving smaller problems and constructing solutions to larger ones from them. This is the other approach to doing a topological sort, where you visit the node with a in-degree of 0, process it, and then remove it. In DP, the smallest problems are solved first because they have a lower in-degree. (Smaller is subjective to the problem at hand.)
The problem here is the generation of a sequence in which the set of subproblems must be solved. Both top-down breadth-first and bottom-up depth-first can't do that.
Top-down Breadth-first will still end up doing something very similar to the depth-first counter part even if the process is separated into threads. There is an order in which the problems must be solved.
A bottom-up depth-first approach MIGHT be able to partially solve problems but the end result would still be similar to the breadth first counter part. The subproblems will be solved in a similar order.
Given that these approaches have almost no improvements over the other approaches, do not translate well with analogies and are tedious to implement, they aren't well established.
#AndyG's comment is pretty much on the point here. I also like #shebang's answer, but here's one that directly answers these questions in this context, not through reduction to another problem.
It's just not clear what a top-down, breadth-first solution would look like. But even if you somehow paused the computation to not do any sub-computations (one could imagine various continuation-based schemes that might enable this), there would be no point to doing so, because there would be sharing of sub-problems.
Likewise, it's unclear that a bottom-up, depth-first solution could solve the problem at all. If you proceed bottom-up but charge all the way up some spine of the computation, but the other sub-problems' solutions aren't already ready and lying in wait, then you'd be computing garbage.
Therefore, top-down, breadth-first offers no benefit, while bottom-up, depth-first doesn't even offer a solution.
Incidentally, a more up-to-date version of the above blog post is now a section in my text (this is the 2014 edition; expect updates.

What's the best way to compare the efficiency of splay trees?

I have implemented several splay tree algorithms.
What's the best way to compare them?
Is it a good start to compare execution time when adding random nodes?
I've also implemented an Binary Search Tree that keeps track of how much every node is visited. I wrote an optimize() method that creates an Optimal Binary Search Tree.
If we do not plan on modifying a search tree, and we know exactly how often each item will be accessed, we can construct an optimal binary search tree, which is a search tree where the average cost of looking up an item (the expected search cost) is minimized.
How can I involve this in the comparison of splay trees?
I like the empirical approach.
In this approach:
Create a bunch of random typical data sets, of various lengths.
Run each implementation and find out what is the execution time for each.
Use Hypothesis testing methods to find out if one implementation is better then the other. In here, the null hypothesis (H0) is "The two implementations should take the same time to execute, on average.
Conclude from step 3 that one implementation is better then the other, with probability 1-p (where p is your p_value).
PS Wilcoxon test is considered a good one, and is used a lot in literature and research to compare two algorithms.

Backtracking Algorithm

How weigth order affects the computing cost in a backtracking algorithm? The number of nodes and search trees are the same but when it's non-ordered it tooks a more time, so it's doing something.
Thanks!
Sometimes in backtracking algorithms, when you know a certain branch is not an answer - you can trim it. This is very common with agents for games, and is called Alpha Beta Prunning.
Thus - when you reorder the visited nodes, you can increase your prunning rate and thus decrease the actual number of nodes you visit, without affecting the correctness of your answer.
One more possibility - if there is no prunning is cache performance. Sometimes trees are stored as array [especially complete trees]. Arrays are most efficient when iterating, and not "jumping randomly". The reorder might change this behavior, resulting in better/worse cache behavior.
The essence of backtracking is precisely not looking at all possibilities or nodes (in this case), however, if the nodes are not ordered it is impossible for the algorithm to "prune" a possible branch because it is not known with certainty if the element Is actually on that branch.
Unlike when it is an ordered tree since if the searched element is greater / smaller the root of that subtree, the searched element is to the right or left respectively. That is why if the tree is not ordered the computational order is equal to brute force, however, if the tree is ordered in the worst case order is equivalent to brute force, but the order of execution is smaller.

Red-Black Trees

I've seen binary trees and binary searching mentioned in several books I've read lately, but as I'm still at the beginning of my studies in Computer Science, I've yet to take a class that's really dealt with algorithms and data structures in a serious way.
I've checked around the typical sources (Wikipedia, Google) and most descriptions of the usefulness and implementation of (in particular) Red-Black trees have come off as dense and difficult to understand. I'm sure for someone with the necessary background, it makes perfect sense, but at the moment it reads like a foreign language almost.
So what makes binary trees useful in some of the common tasks you find yourself doing while programming? Beyond that, which trees do you prefer to use (please include a sample implementation) and why?
Red Black trees are good for creating well-balanced trees. The major problem with binary search trees is that you can make them unbalanced very easily. Imagine your first number is a 15. Then all the numbers after that are increasingly smaller than 15. You'll have a tree that is very heavy on the left side and has nothing on the right side.
Red Black trees solve that by forcing your tree to be balanced whenever you insert or delete. It accomplishes this through a series of rotations between ancestor nodes and child nodes. The algorithm is actually pretty straightforward, although it is a bit long. I'd suggest picking up the CLRS (Cormen, Lieserson, Rivest and Stein) textbook, "Introduction to Algorithms" and reading up on RB Trees.
The implementation is also not really so short so it's probably not really best to include it here. Nevertheless, trees are used extensively for high performance apps that need access to lots of data. They provide a very efficient way of finding nodes, with a relatively small overhead of insertion/deletion. Again, I'd suggest looking at CLRS to read up on how they're used.
While BSTs may not be used explicitly - one example of the use of trees in general are in almost every single modern RDBMS. Similarly, your file system is almost certainly represented as some sort of tree structure, and files are likewise indexed that way. Trees power Google. Trees power just about every website on the internet.
I'd like to address only the question "So what makes binary trees useful in some of the common tasks you find yourself doing while programming?"
This is a big topic that many people disagree on. Some say that the algorithms taught in a CS degree such as binary search trees and directed graphs are not used in day-to-day programming and are therefore irrelevant. Others disagree, saying that these algorithms and data structures are the foundation for all of our programming and it is essential to understand them, even if you never have to write one for yourself. This filters into conversations about good interviewing and hiring practices. For example, Steve Yegge has an article on interviewing at Google that addresses this question. Remember this debate; experienced people disagree.
In typical business programming you may not need to create binary trees or even trees very often at all. However, you will use many classes which internally operate using trees. Many of the core organization classes in every language use trees and hashes to store and access data.
If you are involved in more high-performance endeavors or situations that are somewhat outside the norm of business programming, you will find trees to be an immediate friend. As another poster said, trees are core data structures for databases and indexes of all kinds. They are useful in data mining and visualization, advanced graphics (2d and 3d), and a host of other computational problems.
I have used binary trees in the form of BSP (binary space partitioning) trees in 3d graphics. I am currently looking at trees again to sort large amounts of geocoded data and other data for information visualization in Flash/Flex applications. Whenever you are pushing the boundary of the hardware or you want to run on lower hardware specifications, understanding and selecting the best algorithm can make the difference between failure and success.
None of the answers mention what it is exactly BSTs are good for.
If what you want to do is just lookup by values then a hashtable is much faster, O(1) insert and lookup (amortized best case).
A BST will be O(log N) lookup where N is the number of nodes in the tree, inserts are also O(log N).
RB and AVL trees are important like another answer mentioned because of this property, if a plain BST is created with in-order values then the tree will be as high as the number of values inserted, this is bad for lookup performance.
The difference between RB and AVL trees are in the the rotations required to rebalance after an insert or delete, AVL trees are O(log N) for rebalances while RB trees are O(1). An example of benefit of this constant complexity is in a case where you might be keeping a persistent data source, if you need to track changes to roll-back you would have to track O(log N) possible changes with an AVL tree.
Why would you be willing to pay for the cost of a tree over a hash table? ORDER! Hash tables have no order, BSTs on the other hand are always naturally ordered by virtue of their structure. So if you find yourself throwing a bunch of data in an array or other container and then sorting it later, a BST may be a better solution.
The tree's order property gives you a number of ordered iteration capabilities, in-order, depth-first, breadth-first, pre-order, post-order. These iteration algorithms are useful in different circumstances if you want to look them up.
Red black trees are used internally in almost every ordered container of language libraries, C++ Set and Map, .NET SortedDictionary, Java TreeSet, etc...
So trees are very useful, and you may use them quite often without even knowing it. You most likely will never need to write one yourself, though I would highly recommend it as an interesting programming exercise.
Red Black Trees and B-trees are used in all sorts of persistent storage; because the trees are balanced the performance of breadth and depth traversals are mitigated.
Nearly all modern database systems use trees for data storage.
BSTs make the world go round, as said by Micheal. If you're looking for a good tree to implement, take a look at AVL trees (Wikipedia). They have a balancing condition, so they are guaranteed to be O(logn). This kind of searching efficiency makes it logical to put into any kind of indexing process. The only thing that would be more efficient would be a hashing function, but those get ugly quick, fast, and in a hurry. Also, you run into the Birthday Paradox (also known as the pigeon-hole problem).
What textbook are you using? We used Data Structures and Analysis in Java by Mark Allen Weiss. I actually have it open in my lap as i'm typing this. It has a great section about Red-Black trees, and even includes the code necessary to implement all the trees it talks about.
Red-black trees stay balanced, so you don't have to traverse deep to get items out. The time saved makes RB trees O(log()n)) in the WORST case, whereas unlucky binary trees can get into a lop sided configuration and cause retrievals in O(n) a bad case. This does happen in practice or on random data. So if you need time critical code (database retrievals, network server etc.) you use RB trees to support ordered or unordered lists/sets .
But RBTrees are for noobs! If you are doing AI and you need to perform a search, you find you fork the state information alot. You can use a persistent red-black to fork new states in O(log(n)). A persistent red black tree keeps a copy of the tree before and after a morphological operation (insert/delete), but without copying the entire tree (normally and O(log(n)) operation). I have open sourced a persistent red-black tree for java
http://edinburghhacklab.com/2011/07/a-java-implementation-of-persistent-red-black-trees-open-sourced/
The best description of red-black trees I have seen is the one in Cormen, Leisersen and Rivest's 'Introduction to Algorithms'. I could even understand it enough to partially implement one (insertion only). There are also quite a few applets such as This One on various web pages that animate the process and allow you to watch and step through a graphical representation of the algorithm building a tree structure.
Since you ask which tree people use, you need to know that a Red Black tree is fundamentally a 2-3-4 B-tree (i.e a B-tree of order 4). A B-tree is not equivalent to a binary tree(as asked in your question).
Here's an excellent resource describing the initial abstraction known as the symmetric binary B-tree that later evolved into the RBTree. You would need a good grasp on B-trees before it makes sense. To summarize: a 'red' link on a Red Black tree is a way to represent nodes that are part of a B-tree node (values within a key range), whereas 'black' links are nodes that are connected vertically in a B-tree.
So, here's what you get when you translate the rules of a Red Black tree in terms of a B-tree (I'm using the format Red Black tree rule => B Tree equivalent):
1) A node is either red or black. => A node in a b-tree can either be part of a node, or as a node in a new level.
2) The root is black. (This rule is sometimes omitted, since it doesn't affect analysis) => The root node can be thought of either as a part of an internal root node as a child of an imaginary parent node.
3) All leaves (NIL) are black. (All leaves are same color as the root.) => Since one way of representing a RB tree is by omitting the leaves, we can rule this out.
4)Both children of every red node are black. => The children of an internal node in a B-tree always lie on another level.
5)Every simple path from a given node to any of its descendant leaves contains the same number of black nodes. => A B-tree is kept balanced as it requires that all leaf nodes are at the same depth (Hence the height of a B-tree node is represented by the number of black links from the root to the leaf of a Red Black tree)
Also, there's a simpler 'non-standard' implementation by Robert Sedgewick here: (He's the author of the book Algorithms along with Wayne)
Lots and lots of heat here, but not much light, so lets see if we can provide some.
First, a RB tree is an associative data structure, unlike, say an array, which cannot take a key and return an associated value, well, unless that's an integer "key" in a 0% sparse index of contiguous integers. An array cannot grow in size either (yes, I know about realloc() too, but under the covers that requires a new array and then a memcpy()), so if you have either of these requirements, an array won't do. An array's memory efficiency is perfect. Zero waste, but not very smart, or flexible - realloc() not withstanding.
Second, in contrast to a bsearch() on an array of elements, which IS an associative data structure, a RB tree can grow (AND shrink) itself in size dynamically. The bsearch() works fine for indexing a data structure of a known size, which will remain that size. So if you don't know the size of your data in advance, or new elements need to be added, or deleted, a bsearch() is out. Bsearch() and qsort() are both well supported in classic C, and have good memory efficiency, but are not dynamic enough for many applications. They are my personal favorite though because they're quick, easy, and if you're not dealing with real-time apps, quite often are flexible enough. In addition, in C/C++ you can sort an array of pointers to data records, pointing to the struc{} member, for example, you wish to compare, and then rearranging the pointer in the pointer array such that reading the pointers in order at the end of the pointer sort yields your data in sorted order. Using this with memory-mapped data files is extremely memory efficient, fast, and fairly easy. All you need to do is add a few "*"s to your compare function/s.
Third, in contrast to a hashtable, which also must be a fixed size, and cannot be grown once filled, a RB tree will automagically grow itself and balance itself to maintain its O(log(n)) performance guarantee. Especially if the RB tree's key is an int, it can be faster than a hash, because even though a hashtable's complexity is O(1), that 1 can be a very expensive hash calculation. A tree's multiple 1-clock integer compares often outperform 100-clock+ hash calculations, to say nothing of rehashing, and malloc()ing space for hash collisions and rehashes. Finally, if you want ISAM access, as well as key access to your data, a hash is ruled out, as there is no ordering of the data inherent in the hashtable, in contrast to the natural ordering of data in any tree implementation. The classic use for a hash table is to provide keyed access to a table of reserved words for a compiler. It's memory efficiency is excellent.
Fourth, and very low on any list, is the linked, or doubly-linked list, which, in contrast to an array, naturally supports element insertions and deletions, and as that implies, resizing. It's the slowest of all the data structures, as each element only knows how to get to the next element, so you have to search, on average, (element_knt/2) links to find your datum. It is mostly used where insertions and deletions somewhere in the middle of the list are common, and especially, where the list is circular and feeds an expensive process which makes the time to read the links relatively small. My general RX is to use an arbitrarily large array instead of a linked list if your only requirement is that it be able to increase in size. If you run out of size with an array, you can realloc() a larger array. The STL does this for you "under the covers" when you use a vector. Crude, but potentially 1,000s of times faster if you don't need insertions, deletions or keyed lookups. It's memory efficiency is poor, especially for doubly-linked lists. In fact, a doubly-linked list, requiring two pointers, is exactly as memory inefficient as a red-black tree while having NONE of its appealing fast, ordered retrieval characteristics.
Fifth, trees support many additional operations on their sorted data than any other data structure. For example, many database queries make use of the fact that a range of leaf values can be easily specified by specifying their common parent, and then focusing subsequent processing on the part of the tree that parent "owns". The potential for multi-threading offered by this approach should be obvious, as only a small region of the tree needs to be locked - namely, only the nodes the parent owns, and the parent itself.
In short, trees are the Cadillac of data structures. You pay a high price in terms of memory used, but you get a completely self-maintaining data structure. This is why, as pointed out in other replies here, transaction databases use trees almost exclusively.
If you would like to see how a Red-Black tree is supposed to look graphically, I have coded an implementation of a Red-Black tree that you can download here
IME, almost no one understands the RB tree algorithm. People can repeat the rules back to you, but they don't understand why those rules and where they come from. I am no exception :-)
For this reason, I prefer the AVL algorithm, because it's easy to comprehend. Once you understand it, you can then code it up from scratch, because it make sense to you.
Trees can be fast. If you have a million nodes in a balanced binary tree, it takes twenty comparisons on average to find any one item. If you have a million nodes in a linked list, it takes five hundred thousands comparisons on average to find the same item.
If the tree is unbalanced, though, it can be just as slow as a list, and also take more memory to store. Imagine a tree where most nodes have a right child, but no left child; it is a list, but you still have to hold memory space to put in the left node if one shows up.
Anyways, the AVL tree was the first balanced binary tree algorithm, and the Wikipedia article on it is pretty clear. The Wikipedia article on red-black trees is clear as mud, honestly.
Beyond binary trees, B-Trees are trees where each node can have many values. B-Tree is not a binary tree, just happens to be the name of it. They're really useful for utilizing memory efficiently; each node of the tree can be sized to fit in one block of memory, so that you're not (slowly) going and finding tons of different things in memory that was paged to disk. Here's a phenomenal example of the B-Tree.

Resources