How does Biopython determine the root of a phylogenetic tree? - bioinformatics

There are other packages, particularly ape for R, that build an unrooted tree then allow you to root it by explicitly specifying an outgroup.
In contrast, in BioPython I can directly create a rooted tree without specifying the root, so I'm wondering how the root is being determined, for example from the following code.
from Bio import AlignIO
alignment = AlignIO.read('mulscle-msa-aligned-105628a58654.fasta', 'fasta')
from Bio.Phylo.TreeConstruction import DistanceCalculator
calculator = DistanceCalculator('ident')
dm = calculator.get_distance(alignment)
from Bio.Phylo.TreeConstruction import DistanceTreeConstructor
constructor = DistanceTreeConstructor()
tree = constructor.upgma(dm)
from Bio import Phylo
Phylo.write(tree, 'phyloxml-7016bed7d42.xml', 'phyloxml')
I made up the sequences here after the tree was built, but nonetheless this is a rooted tree built from that process.

As #cel said, this is a product of the UPGMA algorithm. UPGMA creates a tree by working backward from the present (or whenever your data are from). It starts by finding the two most similar species. In theory, these species have a more recent common ancestor than any other pair of species, so they are grouped together. The similarity of their common ancestor to other species in the tree is loosely estimated by averaging each species' similarity to all members of the group.
This process continues, grouping the two most similar species (or presumed common ancestors) in the tree at each step and then recalculating similarities, until there are only two groups left. One of these groups may have only one member, in which case it can effectively be thought of as the outgroup, but they may also both have many members. The root of the tree is the common ancestor of these two groups.

Related

Data structure: a graph that's similar to a tree - but not a tree

I have implemented a data structure in C, based upon a series of linked lists, that appears to be similar to a tree - but not enough to be referred as such, because in theory it allows the existence of cycles. Here's a basic outline of the nodes:
There is a single, identifiable root that doesn't have a parent node or brothers;
Each node contains a pointer to its "father", its nearest "brother" and the first of his "children";
There are "outer" nodes without children and brothers.
How can I name such a data structure? It cannot be a tree, because even if the pointers are clearly labelled and used differently, cycles like father->child->brother->father may very well exist. My question is: terms such as "father", "children" and "brother" can be used in the context of a graph or they are only reserved for trees? After quite a bit of research I'm still unable to clarify this matter.
Thanks in advance!
I'd say you can still call it a tree, because the foundation is a tree data structure. There is precedence for my claim: "Patricia Tries" are referred to as trees even though their leaf nodes may point up the tree (creating cycles). I'm sure there are other examples as well.
It sounds like the additional links you have are essentially just for convenience, and could be determined implicitly (rather than stored explicitly). By storing them explicitly you impose additional invariants on your tree operations (insert, delete, etc), but do not affect the underlying organization, which is a tree.
Precisely because you are naming and treating those additional links separately, they can be thought of as an "overlay" on top of your tree.
In the end it doesn't really matter what you call it or what category it falls into if it works for you. Reading a book like Sedgewick's "Algorithms in C" you realize that there are tons of data structures out there, and nothing wrong with inventing your own!
One more point: Trees are a special case of graphs, so there is nothing wrong with referring to it as a graph (or using graph algorithms on it) as well.

trees for family tree creation

I'm looking for a good algorithm to work out the lowest common ancestor for two people, without an initial overall tree.
If I have a child, I can make a call to get the parent. So if I have two children, I would need to make calls on each until they have a common ancestor. By this point I should have two lists, which I can build into a tree.
What kind of algorithm would be good for this?
If there's some basic information that determines whether a specific node can "in principle" be the parent of another specific node, it would allow you to determine the common ancestor (if any) quickly/efficiently. That's what my question about "age" - see comment - was about.
An age-property obviously would give you that type of information.
Assuming, from your answer, that there's no such information available, there are two obvious approaches:
A. scan both trees upwards (getParent, etcetera), building for each initial child the ancestor list, meanwhile checking with every new ancestor whether it occurs in the other child's ancestor list. Generally speaking, there is no reason to assume that the common ancestor is at ("almost") equal distance from the two children, so any order in which you build the respective ancestor-lists is equally like to give you the common ancestor in the fewest possible steps.
A drawback of this method is that you run the risk of having to search in the mutual lists a large number of times for the occurrence of a new candidate. This is potentially slow.
B. an alternative is to simply build each of the two ancestorlists independently until exhaustion. This is probably relatively fast as you are not interrupted at each step by having to check the contents of "the other" list.
Then, when you've finished the two lists, you either have a common ancestor or not, simply by verifying the top-items against each other.
If no, you're done (no result). If yes, you walk back down the lists synchronously and check where they separate, again avoiding the list-search.
From a general point of view I'd prefer method B. Occasionally method A will of course produce the result faster, but in bad cases it may become tedious/slow, which will not happen with method B. I assume, of course, that any ancestorlist cannot be circular, i.e, that a node cannot be an ancestor of itself.
As a final note: IF, after all, you do have a property that "orders" nodes in the sense of "age", then you should build the two lists simultaneously, in the sense that you always process the list that has currently a top node which cannot possibly be the parent of the other list's topnode. This allows you to check only the current topnodes against each other (and not any other members of the lists).

Suitable tree data structure

I have been reading about tree data structure to model a problem. I need to construct memory representation of a data which is very similar to folder/file representation in file system (I don't imply the actual file stored in disk but the explorer like structure). The tree may be maximum 10 deep The intermediate nodes may only have moderate number of children (say 10 ), but there could be thousands of leaf nodes.[that is like thousands of files in the folder and file is the leaf node]
Some thoughts
A Binary tree cannot work as one node can at the most have only 2
children. (say we can have 3 subfolders)
A very generic tree implementation may be inefficient as my data can be ordered. Like the left sibling is smaller/lesser than the right ones. I hope this allow to have efficient traversal.
A B-tree sounds very close, but does it insist balancing requirements. In my case, the depth won't be more than 10, but not necessarily all the branch that deep.(say c:/windows , C:/MyDoc../A/B/C)
Please help with your experience. Should I custom make a tree or any suitable data structure available (don't mean specific to a programming language)
You have two different kinds of nodes: files and folders.
A folder node contains a set (or map) of children, where the children may themselves be files or folders.
Alternatively, you might prefer for a folder node to contain a set of files and a set of folders.
For the sets, just use your favorite representation of ordered sets (probably the one that comes with whatever language you are using). Depending the exact details of your situation, you might prefer to use a map instead.
Use two separate data structures:
A binary search tree for search
And a general binary tree for representation
and link these two together.
Note:
In general tree put folders first in order and put all files in a BST as one last node.
Or Use:
Node:
Node* Left_Most_Child_Folder;
Node* Right_Sibling_Folder;
BST_Node* Files_Root;
In a typical file system, the "directory-tree" and the search tree are not the same thing, and are usually maintained separately. The "directory-tree", which tells you what files/sub-folders a folder has, or the path to a particular file, simply reflects how the user organizes the files and is only useful to the user. The search tree on the other hand maintains the global index of all files, so as to facilitate a fast search.
For example, you can implement a Linux like file system, where a folder is a file that records the pointers of the other files/folders it contains. At the same time you maintain a B+ tree, which has every file pointer as a leaf. The balance condition of the B+ tree has nothing to do with how the user organizes the folders.
One way to do this would be to use a binary tree of binary trees. For example:
Node
Node* Children;
Node* Left;
Note* Right;
And the root of your tree is a Node*.
This makes for easy traversal and quick insertion and removal of a node. Provided, of course, you know the path to the level where you want to insert the node, or the path to the node that you want to delete. But since you indicate that you want a model similar to Explorer, I assume that finding a particular level doesn't pose a problem.
Searching for a node at a particular level is as simple as searching a binary tree.
Without a little bit more information about what you're trying to model, that's the best I can do.

Clustering tree structured data

Suppose we are given data in a semi-structured format as a tree. As an example, the tree can be formed as a valid XML document or as a valid JSON document. You could imagine it being a lisp-like S-expression or an (G)Algebraic Data Type in Haskell or Ocaml.
We are given a large number of "documents" in the tree structure. Our goal is to cluster documents which are similar. By clustering, we mean a way to divide the documents into j groups, such that elements in each looks like each other.
I am sure there are papers out there which describes approaches but since I am not very known in the area of AI/Clustering/MachineLearning, I want to ask somebody who are what to look for and where to dig.
My current approach is something like this:
I want to convert each document into an N-dimensional vector set up for a K-means clustering.
To do this, I recursively walk the document tree and for each level I calculate a vector. If I am at a tree vertex, I recur on all subvertices and then sum their vectors. Also, whenever I recur, a power factor is applied so it does matter less and less the further down the tree I go. The documents final vector is the root of the tree.
Depending on the data at a tree leaf, I apply a function which takes the data into a vector.
But surely, there are better approaches. One weakness of my approach is that it will only similarity-cluster trees which has a top structure much like each other. If the similarity is present, but occurs farther down the tree, then my approach probably won't work very well.
I imagine there are solutions in full-text-search as well, but I do want to take advantage of the semi-structure present in the data.
Distance function
As suggested, one need to define a distance function between documents. Without this function, we can't apply a clustering algorithm.
In fact, it may be that the question is about that very distance function and examples thereof. I want documents where elements near the root are the same to cluster close to each other. The farther down the tree we go, the less it matters.
The take-one-step-back viewpoint:
I want to cluster stack traces from programs. These are well-formed tree structures, where the function close to the root are the inner function which fails. I need a decent distance function between stack traces that probably occur because the same event happened in code.
Given the nature of your problem (stack trace), I would reduce it to a string matching problem. Representing a stack trace as a tree is a bit of overhead: for each element in the stack trace, you have exactly one parent.
If string matching would indeed be more appropriate for your problem, you can run through your data, map each node onto a hash and create for each 'document' its n-grams.
Example:
Mapping:
Exception A -> 0
Exception B -> 1
Exception C -> 2
Exception D -> 3
Doc A: 0-1-2
Doc B: 1-2-3
2-grams for doc A:
X0, 01, 12, 2X
2-grams for doc B:
X1, 12, 23, 3X
Using the n-grams, you will be able to cluster similar sequences of events regardless of the root node (in this examplem event 12)
However, if you are still convinced that you need trees, instead of strings, you must consider the following: finding similarities for trees is a lot more complex. You will want to find similar subtrees, with subtrees that are similar over a greater depth resulting in a better similarity score. For this purpose, you will need to discover closed subtrees (subtrees that are the base subtrees for trees that extend it). What you don't want is a data collection containing subtrees that are very rare, or that are present in each document you are processing (which you will get if you do not look for frequent patterns).
Here are some pointers:
http://portal.acm.org/citation.cfm?id=1227182
http://www.springerlink.com/content/yu0bajqnn4cvh9w9/
Once you have your frequent subtrees, you can use them in the same way as you would use the n-grams for clustering.
Here you may find a paper that seems closely related to your problem.
From the abstract:
This thesis presents Ixor, a system which collects, stores, and analyzes
stack traces in distributed Java systems. When combined with third-party
clustering software and adaptive cluster filtering, unusual executions can be
identified.

What's a good data structure for building equivalence classes on nodes of a tree?

I'm looking for a good data structure to build equivalence classes on nodes of a tree. In an ideal structure, the following operations should be fast (O(1)/O(n) as appropriate) and easy (no paragraphs of mystery code):
(A) Walk the tree from the root; on each node --> child transition enumerate all the equivalent versions of the child node
(B) Merge two equivalence classes
(C) Create new nodes from a list of existing nodes (the children) and other data
(D) Find any nodes structurally equivalent to node (i.e. they have the same number of children, corresponding children belong to the same equivalence class, and their "other data" is equal) so that new (or newly modified) nodes may be put in the right equivalence class (via a merge)
So far I've considered (some of these could be used in combination):
A parfait, where the children are references to collections of nodes instead of to nodes. (A) is fast, (B) requires walking the tree and updating nodes to point to the merged collection, (C) requires finding the collection containing each child of the new node, (D) requires walking the tree
Maintaining a hash of nodes by their characteristics. This makes (D) much faster but (B) slower (since the hash would have to be updated when equivalence classes were merged)
String the nodes together into a circular linked list. (A) is fast, (B) would be fast but for the fact that that "merging" part of a circular list with itself actually splits the list (C) would be fast, (D) would require walking the tree
Like above, but with an additional "up" pointer in each node, which could be used to find a canonical member of the circular list.
Am I missing a sweet alternative?
You seem to have two forms of equivalence to deal with. Plain equivalence (A), tracked as equivalence classes which are kept up to date and structural equivalence (D), for which you occasionally go build a single equivalence class and then throw it away.
It sounds to me like the problem would be conceptually simpler if you maintain equivalence classes for both plain and structural equivalence. If that introduces too much churn for the structural equivalence, you could maintain equivalence classes for some aspects of structural equivalence. Then you could find a balance where you can afford the maintenance of those equivalence classes but still greatly reduce the number of nodes to examine when building a list of structurally equivalent nodes.
I don't think any one structure is going to solve your problems, but you might take a look at the Disjoint-set data structure. An equivalence class, after all, is the same thing as a partitioning of a set. It should be able to handle some of those operations speedily.
Stepping back for a moment I'd suggest against using a tree at all. Last time I had to confront a similar problem, I began with a tree, but later moved onto an array.
Reasons being multiple but number one reason was performance, my classes with up to 100 or so children would actually perform better while manipulating them as array than through the the nodes of a tree, mostly because of hardware locality, and CPU prefetch logic, and CPU pipelining.
So although algorithmically an array structure requires a larger N of operations than a tree, performing these dozens of operations is likely faster than chasing pointers across memory.

Resources