Ideal data structure for metabolic pathways - data-structures

So I have a huge list of chemicals within an organism, with the data for both their precursor chemicals, and the ones they created.
I was thinking that some sort of tree structure would be appropriate; each chemical is a node, each parent is a precursor, each child is a product.
Each node could have more than one parent or more than one child, hence my confusion!
However, the main function in this structure will be to find ALL the chemical pathways to make it, and I'm not sure if a tree would be the most efficient at this sort of search.
My question is: is there a more appropriate data structure for this type of data and operation?

I think your data structure is a directed graph.
The brute-force approach for finding all the pathways from A to B would be to do a breadth-first search starting in A, and cover as much of the graph as you are allowed to.
This guarantees that the paths you'll find will be ordered in length from shortest to longest.
Whenever you hit B, you should mark all of the nodes in that path as 'leading to B'. This way you can account for convergent pathways without having to walk the graph more than once.
Keep in mind that, unless you constrain it, it can be posible for the graph to contain loops. A loop in a pathway leading from A to B presents you with infinite pathways, so it's up to you how you'd like to handle this cases.

Related

Are there any rules to choose the first adjacent vertex to do graph traversal?

Suppose that I have a directed graph like this, and I want to traverse by using Depth-first search method.
[D] <-- [C] <-- [A] --> [B]
I'm going to start out at the vertex A.
The vertex A has two adjacent vertexes B and C.
I wonder, which vertex should I select first to traverse?
It can be A,C,D,B or A,B,C,D which one is correct? Are there any rules?
There are no rules. Either order is equally correct. However, sometimes an instructor will tell you to traverse nodes in a particular order, such as alphabetically; in that case, of course do what your instructor says. But without explicit instruction, you can iterate through adjacent vertices in whatever order you like.
Short answer: No.
As there is no correct or canonical order of the vertices of a graph (in general), there also is no such order for the DFS algorithm.
A Graph stored as a data structure in the memory of a computer, always has some vertex order, due to the linear memory address space. Depending on the labels/properties you put on your vertices, you could use them to establish an explicit ordering criteria, e.g. their alphabetical or numerical order. In general, this might lead to more deterministic results, but won't benefit the runtime.
Depending on the data structure, it's memory layout and the target architecture the algorithm will be executed on, there might be orderings that increase e.g. data locality while traversing the graph, and thus can speed-up the execution of the algorithm.
Depending on the problem the graph models, there might be beneficial orderings for special cases. Think of a case where the DFS is used to search for some vertex with a given property and then aborts as soon as a matching vertex is found. If a probability for finding such a vertex could be assigned to each vertex, then traversing the vertex with the highest probability first would clearly be a good idea.

Is there a name for this data structure that is kind of "opposite" of a tree?

We all know what a tree is: on the first level of a tree we have a root, and from the root come branches that are trees as well. But how do I name the "opposite" structure: on the i-th level we have a set of "leaf" nodes, and those nodes form groups of 1+ nodes, and a group points to a "trunk" node on i+1th level. If you want a visual example, imagine raindrops flowing down a window and combining as they collide.
A lot of tree data structures are actually constructed from leaf to root, and can be stored to allow for going one or both directions.
I don't think it really has a special name as it's more a convention than a requirement for trees typically to go from root to leaf rather than the other way or both ways. Also there are a number of tree data structures that allow for going both ways.
Every tree is a DAG, a directed acyclic graph, and so is the data-structure that you describe. What you describe is also a multitree, a subset of DAGs. Possibly there is a more precise real subset of multitrees that describes your graph, but I am not aware of it. Hope this helps.

Directed Graph Versus Associative Array

I have been reading up on directed graphs. I have managed to get an abstract graph data type working in my application but I don't find it particularly intuitive and am considering replacing it with an ordinary multi-dimensional array.
My graph is sparse and acyclic. Each vertex is reachable from one particular 'master' vertex. If it was a tree, this master vertex would be the 'root'. It it was a social network, this master vertex would be 'me'.
Although my graph may have hundreds of thousands of vertices it has a finite depth: the greatest distance between any two nodes is 3 edges.
The underlying data representation is an adjacency list. A small example would look like this:
Head | Tails
--------------
1 | 2, 3, 4
2 | 5
3 | 5
4 | 5
5 | 6
If I was using an ordinary multi-dim array instead of my graph data type, it would look something like this:
$me[1][2][5][6]
$me[1][3][5][6]
$me[1][4][5][6]
Now, the main things that I want to be able to do with this graph are:
Navigate it as a hierarchy. I realise that some child vertices will feature in more than one category (e.g. #5), but that is what I want for this particular use case. I can't see any real difference between an array and a graph for this point.
Lay it out as a list (alphabetical, according to vertex name), with no duplicates. I would probably do a DFS, flagging visited vertices as I go, to avoid exploring them more than once. But as far as I can see this is achievable using either the graph or the array, and at the same cost.
Do an 'all paths' analysis for any given pair of points. Because I want 'all paths' (ie. I'm not simply checking for reachability), it seems to me that I have to traverse the entire graph, and again I can see no advantage in a graph over an array.
I get the feeling that I am missing something, but I can't put my finger on it. Can you??? Any ideas, suggestions, insights or advice gratefully accepted... (By the way, I'm using PHP, and the data source is a relational DB. I don't think this makes any real difference though).
Thanks!
One thing you need to understand is that a directed graph (or digraph) is a concept, whereas an associative array is a data structure.
An instance of the digraph concept can be stored in many different data structures, of which you can find the most common on this wikipedia page.
I'm not sure what you are doing with your multidimensional array... storing all paths? You will end up with a N³ space complexity, and trouble building it. A tree-based structure would be more efficient at the very least.
Now to the things you want to do with your graph:
Navigate as a hierarchy. The basic digraph concept doesn't allow to go up in the hierarchy, but you can easily store the reverse graph as well (especially with matrix-based representations, just use 3 values instead of 2 - forward, backward and nothing) .
Lay it out as a list, according to name. You have to store the name somewhere (either in a side map or in the vertex object), but it shouldn't be any harder than sorting anything else according to name.
Do an 'all paths' analysis. You can probably get away with linear complexity (in the number of paths) through DP and a shared representation of paths.
It looks that your data structure is too complicated. If you represent a directed graph as a multidimensional array, it is almost always of dimension two so that
$array[$x][$y]
is a boolean value that is TRUE if and only if there is an edge from node $x to node $y in the graph. In your example if would be e.g.
$array[1][2] = TRUE
$array[1][5] = FALSE
But for sparse graphs, using this boolean matrix representation is not usually good. Typically you would have a one-dimensional array that maps every node to a set of nodes to which there is an edge, e.g.
$array[1] = { 2, 3, 4 }
where { ... } means some sort of an unordered collection data structure, which can be e.g. a binary search tree or a hash set (hash table).
This data structure enables you to quickly find the nodes to which there is an arc from a given node, which is a key feature for graph algorithms.
Sometimes you want to be able to traverse your graph backwards also; in that case you would have another array that maps nodes to the list of their predecessors.

What are good ways of organizing directed graph data?

Here's my situation. I have a graph that has different sets of data being added at different times. For example, set1 might have a few thousand nodes and then set2 comes in later and we apply business logic to create edges from set1 to set2(and disgard any Vertices from set1 that do not have edges to set2). Then at a later point, we get set3, set4, and so on and the same process applies between each set and its previous set.
Question, what's the best way to organize this? What I did before was name the nodes set1-xx, set2-xx,etc.. The problem I faced was when I was trying to run analytics between the current set and the previous set I would have to run a loop through the entire graph and look for all the nodes that started with 'setx'. It took a long time as the graph grew, so I thought of another solution which was to create a node called 'set1' and have it connected to all nodes for that particular set. I am testing it but I was wondering if there way a more efficient way or a build in way of handling data structures like this? Is there a way to somehow segment data like this?
I think a general solution would be application but if it helps I'm using neo4j(so any specific solution to that database would be good as well).
You have a very special type of a directed graph, called a layered graph.
The choice of the data structure depends primarily on the expected graph density (how many nodes from a previous set/layer are typically connected to a node in the current set/layer) and on the operations that you need to perform on it most of the time. It is definitely a good idea to have each layer directly represented by a numeric index (that is, the outermost structure will be an array of sets/layers), and presumably you can also use one array of vertices per layer. However, the list of edges per vertex (out only, or in and out sets of edges depending on whether you ever traverse the layers backward) may be any of the following:
Linked list of vertex identifiers; this is good if the graph is very sparse and edges are often added/removed.
Sorted array of vertex identifiers; this is good if the graph is quite sparse and immutable.
Array of booleans, indexed by vertex identifiers, determining whether a given vertex is or is not linked by an edge from the current vertex; this is good if the graph is dense.
The "vertex identifier" can take many forms. For example, it can be an index into the array of vertices on the next layer.
Your second solution is what I would do- create a setX node and connect all nodes belonging to that set to setX. That way your data is partitioned and it is easier to query.

how to decide whether two persons are connected

Here is the problem:
assuming two persons are registered in a social networking website, how to decide whether they are connected or not?
my analysis (after reading more): actually, the question is looking for - the shortest path from A to B in a graph. I think both BFS and Dijkstra's Algorithms works here and time complexity is exactly the same (O(V+E)) because it is an unweighted graph, so we can't take advantage of the priority queue. So, a simple queue could resolve the problem. But, both of them doesnt resolve the problem that: find the path between them.
Bidrectrol should be a better solution at this point.
To find a path between the two, you should begin with a breadth first search. First find all neighbors of A, then find all neighbors of all neighbors of A, etc. Once B is hit, not only do you have a path from A to B, but you also have a shortest such path.
Dijkstra's algorithm rocks, and you may be able to speed this up by working from both end, i.e. find neighbors of A and neighbors of B, and compare.
If you do a depth first search, then you're following one path at a time. This will be much much slower.
If you do dfs for finding whether two people are connected on a social network, then it will take too long!
You already know the two persons, so you should use Bidirectional Search.. But, simple bidirectional search won't be enough for a graph as big as a social networking site. You will have to use some heuristics. Wikipedia page has some links to it.
You may also be able to use A* search. From wikipedia : "A* uses a best-first search and finds the least-cost path from a given initial node to one goal node (out of one or more possible goals)."
Edit: I suggest A* because "The additional complexity of performing a bidirectional search means that the A* search algorithm is often a better choice if we have a reasonable heuristic." So, if you can't form a reasonable heuristic, then use Bidirectional search. (Forming a good heuristic is never easy ;).)
One way is to use Union Find, add all links union(from,to), and if find(A) is find(B) is True then A and B are connected. This avoids the recursive search but it actually computes the connectivity of all pairs and doesn't give you the paths that connects A and B.
I think that the true criteria is: there are at least N paths between A and B shorter then K, or A and B are connected diectly. I would go with K = 3 and N near 5, i.e. have 5 common friends.
Note: answer edited.
Any method might end up being very slow. If you need to do this repeatedly, it's best to find the connected components of the graph, after which the task becomes a trivial O(1) operation: if two people are in the same component, they are connected.
Note that finding connected components for the first time might be slow, but keeping them updated as new edges/nodes are added to the graph is fast.
There are several methods for finding connected components.
One method is to construct the Laplacian of the graph, and look at its eigenvalues / eigenvectors. The number of zero eigenvalues gives you the number of connected components. The non-zero elements of the corresponding eigenvectors gives the nodes belonging to the respective components.
Another way is along the following lines:
Create a transformation table of nodes. Element n of the array contains the index of the node that node n transforms to.
Loop through all edges (i,j) in the graph (denoting a connection between i and j):
Compute recursively which node do i and j transform to based on the current table. Let us denote the results by k and l. Update entry k to make it transform to l. Update entries i and j to point to l as well.
Loop through the table again, and update each entry to point directly to the node it recursively transforms to.
Now nodes in the same connected component will have the same entry in the transformation table. So to check if two nodes are connected, just check if they transform to the same value.
Every time a new node or edge is added to the graph, the transformation table needs to be updated, but this update will be much faster than the original calculation of the table.

Resources