I want to implement a mutable directional graph that ensures that nodes exist at most once.
This is for a board game AI for linking gamestates with each other.
Nodes need to have a mutable part (the NodeData) and an immutable part (the Gamestate). Gamestates are quite large (~80bytes), so I don't want to store redundant gamestates.
So far, my approach has been using a HashMap for the nodes and a HashSet for the edges.
type Node = (Rc<Gamestate>, NodeData);
struct IterativePSearch {
root: Gamestate,
nodes: HashMap<Rc<Gamestate>, NodeData>,
edges: HashSet<Edge>,
}
struct Edge {
from: Rc<Gamestate>,
to: Rc<Gamestate>,
}
Is this a viable approach? Is it minimal in memory, or is there a more efficient way?
My aim here was to store references to Gamestates in Edges, so that I don't have to duplicate the Gamestate. Is this actually happening?
Related
I keep seeing everywhere that there are 3 ways to represent graphs:
Objects and pointers
Adjacency matrix
Adjacency lists
However, I just plain don't understand what these Object and pointer representations are - yet every recruiter, and many blogs cite Steve Yegge's blog that they are indeed a separate representation.
This widely accepted answer to a very similar question seems to suggest that the vertex structures themselves have no internal pointers to other vertices, and instead all edges are represented by edge structures which contain pointers to the adjacent vertices.
How does this representation offer any discernible analytical advantage in any scenario?
From the top of my head, I hope I have the facts correct.
Conceptually, graph tries to represent how a set of nodes (or vertices) are related (connected) to each other (via edges).
However, in actual physical device (memory), we have a continuous array of memory cell.
So, in order to represent the graph, we can choose to use a matrix.
In this case, we use the vertex index as the row and column and the entry has value 1 if the vertices are adjacent to each other, 0 otherwise.
Alternatively, you can also represent a graph by allocating an object to represent the node/vertex which points to a list of all the nodes that are adjacent to it.
The matrix representation gives the advantage when the graph is dense, meaning when most of the nodes/vertices are connected to each other. This is because in such cases, by using the entry of matrix, it saves us from having to allocate an extra pointer (which need a word size memory) for each connection.
For sparse graph, the list approach is better because you don't need to account for the 0 entries when there is no connection between the vertices.
Hope it helps.
For now I have a hard time finding a pro w.r.t typical "graph algorithms". But it sure is possible to represent a graph with objects and pointers and a very natural thing to do if you think of it as a representation of something you just drew on a whiteboard.
Think of a scenario where you want to combine nodes of a graph in a certain order.
Nodes have payloads that contain domain data, the graph structure itself is not a core aspect of your program.
Sure, you can update your lists / matrix for every operation, but given an "objects and pointers" structure, you can do the merging locally. Further, if nodes have payloads, it means that lists/matrix will feature node id's that identify the actual node objects. A combination would mean you update your graph representation, follow the node identifiers and do the actual processing. It may feel more intuitively to work on your actual node objects and simply remove pointerswhen collapsing a neighbor (and delete that node) .
Besides, there are more ways to represent a graph:
E.g. just as triples, like Turle does
Or as offset
representation (offsets per node into an edge array), e.g. this
Boost data structure (disclaimer: I have not tested the linked
implementation myself)
etc
Here a way i have been using to create Graph with this concept :
#include <vector>
class Node
{
public:
Node();
void setLink(Node *n); // *n as argument to pass the address of the node
virtual ~Node(void);
private:
vector<Node*> m_links;
};
And the function responsible for creating the link between vertices is :
void Node::setLink(Node *n)
{
m_links.push_back(n);
}
Objects and pointers representation reduces space complexity to exactly V+E, where V is the number of vertices, E - the number of edges (down from V+2E in Adjacency List or even 2V+2E if you store index->Vertex mapping in a separate hash map), sacrificing time complexity: particular edge lookup will take O(E), which equals O(V^2) in a Dense graph (up from O(V) in Adjacency List). The space saving is achieved by removing duplicated edges that appear in the Adjacency List.
I have been reading up on directed graphs. I have managed to get an abstract graph data type working in my application but I don't find it particularly intuitive and am considering replacing it with an ordinary multi-dimensional array.
My graph is sparse and acyclic. Each vertex is reachable from one particular 'master' vertex. If it was a tree, this master vertex would be the 'root'. It it was a social network, this master vertex would be 'me'.
Although my graph may have hundreds of thousands of vertices it has a finite depth: the greatest distance between any two nodes is 3 edges.
The underlying data representation is an adjacency list. A small example would look like this:
Head | Tails
--------------
1 | 2, 3, 4
2 | 5
3 | 5
4 | 5
5 | 6
If I was using an ordinary multi-dim array instead of my graph data type, it would look something like this:
$me[1][2][5][6]
$me[1][3][5][6]
$me[1][4][5][6]
Now, the main things that I want to be able to do with this graph are:
Navigate it as a hierarchy. I realise that some child vertices will feature in more than one category (e.g. #5), but that is what I want for this particular use case. I can't see any real difference between an array and a graph for this point.
Lay it out as a list (alphabetical, according to vertex name), with no duplicates. I would probably do a DFS, flagging visited vertices as I go, to avoid exploring them more than once. But as far as I can see this is achievable using either the graph or the array, and at the same cost.
Do an 'all paths' analysis for any given pair of points. Because I want 'all paths' (ie. I'm not simply checking for reachability), it seems to me that I have to traverse the entire graph, and again I can see no advantage in a graph over an array.
I get the feeling that I am missing something, but I can't put my finger on it. Can you??? Any ideas, suggestions, insights or advice gratefully accepted... (By the way, I'm using PHP, and the data source is a relational DB. I don't think this makes any real difference though).
Thanks!
One thing you need to understand is that a directed graph (or digraph) is a concept, whereas an associative array is a data structure.
An instance of the digraph concept can be stored in many different data structures, of which you can find the most common on this wikipedia page.
I'm not sure what you are doing with your multidimensional array... storing all paths? You will end up with a N³ space complexity, and trouble building it. A tree-based structure would be more efficient at the very least.
Now to the things you want to do with your graph:
Navigate as a hierarchy. The basic digraph concept doesn't allow to go up in the hierarchy, but you can easily store the reverse graph as well (especially with matrix-based representations, just use 3 values instead of 2 - forward, backward and nothing) .
Lay it out as a list, according to name. You have to store the name somewhere (either in a side map or in the vertex object), but it shouldn't be any harder than sorting anything else according to name.
Do an 'all paths' analysis. You can probably get away with linear complexity (in the number of paths) through DP and a shared representation of paths.
It looks that your data structure is too complicated. If you represent a directed graph as a multidimensional array, it is almost always of dimension two so that
$array[$x][$y]
is a boolean value that is TRUE if and only if there is an edge from node $x to node $y in the graph. In your example if would be e.g.
$array[1][2] = TRUE
$array[1][5] = FALSE
But for sparse graphs, using this boolean matrix representation is not usually good. Typically you would have a one-dimensional array that maps every node to a set of nodes to which there is an edge, e.g.
$array[1] = { 2, 3, 4 }
where { ... } means some sort of an unordered collection data structure, which can be e.g. a binary search tree or a hash set (hash table).
This data structure enables you to quickly find the nodes to which there is an arc from a given node, which is a key feature for graph algorithms.
Sometimes you want to be able to traverse your graph backwards also; in that case you would have another array that maps nodes to the list of their predecessors.
I'm looking for an efficent way to implement a weighted undirected graph knowing only the number of edges ahead of time.
sample input:
N (number of edges)
A B x (x is the distance from A to B)
.
.
I've thinked to use adjacency lists of Node* (I need to know neighbours) and stored nodes in a dynamic hash table (I don't know how many nodes I'll take so I need a dynamic - search/insert - container).
Are there better ways to do it?
Sorry for my bad english! :D
Given the format you're getting the input in, a very reasonable approach would be to use either a hash table of lists, where the keys are the nodes and the values are lists of pairs of (node, distance). Alternatively, if you have a dense graph and want to be able to quickly determine the distance from one node to another, it might be good to have a hash table of hash tables, where the top level hash table maps nodes to a second hash table, which then maps each node the original node has an edge to to its cost. This still lets you iterate across a node's outgoing edges, but gives you faster lookup of distances.
Another idea (depending on the use case) would be to start off by building the first data structure (the hash table of lists), then to post process it by building an adjacency matrix. This would be useful if you didn't need to iterate across a node's outgoing edges and needed fast random access to distances between nodes. It is similar to the hash table of hash tables, but is probably more space efficient.
Hope this helps!
Here's my situation. I have a graph that has different sets of data being added at different times. For example, set1 might have a few thousand nodes and then set2 comes in later and we apply business logic to create edges from set1 to set2(and disgard any Vertices from set1 that do not have edges to set2). Then at a later point, we get set3, set4, and so on and the same process applies between each set and its previous set.
Question, what's the best way to organize this? What I did before was name the nodes set1-xx, set2-xx,etc.. The problem I faced was when I was trying to run analytics between the current set and the previous set I would have to run a loop through the entire graph and look for all the nodes that started with 'setx'. It took a long time as the graph grew, so I thought of another solution which was to create a node called 'set1' and have it connected to all nodes for that particular set. I am testing it but I was wondering if there way a more efficient way or a build in way of handling data structures like this? Is there a way to somehow segment data like this?
I think a general solution would be application but if it helps I'm using neo4j(so any specific solution to that database would be good as well).
You have a very special type of a directed graph, called a layered graph.
The choice of the data structure depends primarily on the expected graph density (how many nodes from a previous set/layer are typically connected to a node in the current set/layer) and on the operations that you need to perform on it most of the time. It is definitely a good idea to have each layer directly represented by a numeric index (that is, the outermost structure will be an array of sets/layers), and presumably you can also use one array of vertices per layer. However, the list of edges per vertex (out only, or in and out sets of edges depending on whether you ever traverse the layers backward) may be any of the following:
Linked list of vertex identifiers; this is good if the graph is very sparse and edges are often added/removed.
Sorted array of vertex identifiers; this is good if the graph is quite sparse and immutable.
Array of booleans, indexed by vertex identifiers, determining whether a given vertex is or is not linked by an edge from the current vertex; this is good if the graph is dense.
The "vertex identifier" can take many forms. For example, it can be an index into the array of vertices on the next layer.
Your second solution is what I would do- create a setX node and connect all nodes belonging to that set to setX. That way your data is partitioned and it is easier to query.
I need to store a directed graph (not necessarily acyclic), so that node deletion is as fast as possible. I wouldn't mind storing additional data in order to know exactly which edges have to go when a node is deleted.
If I store a list of edges (as pairs of node indexes), then when killing some node n I have to search the whole list for edges whose source or target is n. This is too costly for my application. Can this search be avoided by storing some additional data in the nodes?
One idea would be to have each node store its own sources and targets, as two separate lists. When node n is killed, its lists are killed too. But then, how would all the targets/sources linked to node n know to update their own lists (i.e., to eliminate the defunct node from their lists)? This would require some costly searching...
Can it be avoided?
Thx.
You have two choices without getting too fancy are Adjacency List and Adjacency Matrix. The former is probably best for what you're doing. To remove a node, simply eliminate the list for that node for all of its out edges. For the in-edges, you might consider keeping a hash-table for each list for O(1) lookups.
This is a good overview
http://www.algorithmist.com/index.php/Graph_data_structures
I solved it! This is the solution for undirected graphs, adding direction is easy afterwards.
In each vertex I keep a special adjacency list. It is a list (double linked, for easy insertion/deletion) whose elements are "slots":
class Slot {
Slot prev, next; // pointers to the other slots in the list
Slot other_end; // the other end of the edge: not a vertex, but a Slot!
Vertex other_vertex; // the actual vertex at the other end
void kill() {
if (next!=null) next.kill(); // recursion
other_end.pop_out();
}
void pop_out() {
if (next!=null) next.prev = prev;
if (prev!=null) prev.next = next;
else other_end.other_vertex.slot_list = next; // in case this slot is the
// first in its list, I need
// to adjust the vertex's
// slot_list pointer.
// other_end.other_vertex is actually the vertex to which this slot belongs;
// but this slot doesn't know it, so I have to go around like this.
}
}
So basically each edge is represented by two slots, cross-pointing one to each other. And each vertex has a list of such slots.
When a vertex is killed, it sends recursively a "kill" signal up its slot list. Each slot responds by destroying its other_end (which graciously pops out from the neighbor's list, mending the prev/next pointers behind).
This way a vertex plus all its edges are deleted without any searching. The price I have to pay is memory: instead of 3 pointers (prev, next and vertex for a regular double linked adjacency list), I have to keep 4 pointers (prev, next, vertex and other_end).
This is the basic idea. For directed graphs, I only have to distinguish somehow between IN slots and OUT slots. Probably by dividing each vertex's adjacency list in two separate lists: IN_slot_list and OUT_slot_list.