I'd like to:
Represent a DAG according to The Ruby Way.
Generate an image from the DAG.
There would be no more than 100 nodes (and often far less, say, 10 for the 80th percentile case). I do not need to permanently store the data, just the image, so no database considerations need to be made.
Graphviz is the tried-and-true tool for visualizing all sorts of graphs, and it's been around for a while. See: http://www.graphviz.org/
There's Ruby wrapper around it available, see: https://github.com/glejeune/Ruby-Graphviz
(disclaimer: I've used Graphviz, but not the Ruby wrapper)
If all you need to do is output images, I'd feed textual representations into the "dot" tool of graphviz.
Related
I have two graphviz graphs. Let's call them before.dot and after.dot.
I want to know the differences between them. I've opened them with regular old text/source code diff and there is a difference, it's not a subset and superset situation, there are nodes and edges only in before.dot, nodes and edges only in after.dot and nodes and edges that are in both. How can I process these two and produce before-only.dot and after-only.dot (even if this is two separate commands)
Reading the graphviz docs pointed me to the gvpr scripting/processing tool, which appears to be the ideal mechanism for solving this given that I would like to avoid installing additional tools.
How can I get this task done using gvpr?
Came across this because I am just starting learning about gvpr.
Years ago I did make a specific dot file differ in 'awk'
for just a particular flavor of dot files.
https://github.com/TomConlin/dipper/blob/master/scripts/deltadot.awk
I would not expect it to work in the general case but
if you are comparing a distilled semantic graph with
a previous version of itself you might get some ideas.
I am interested in persisting individual directed graphs. This question is not asking for a full-scale graph database solution, but for a document format that I can use to save and individual arbitrary directed graph. I don't know what notation and file format would be the smartest choice.
My primary concerns are:
Expressiveness/Flexibility - I want the ability to express graphs of different types. While the standard use case would be a simple directed graph, it should be possible to express trees, cyclical graphs, multi-graphs. As a bare minimum, I would expect support for labeling and weighting of edges and nodes. Notations for describing higraphs and edge composition/hyper-edges would also be highly desirable, although I am aware that such solutions may not exist.
Type System-Independence - I am interested in representing the structural qualities of graphs. Some solutions include an extensible type system for typed edges and nodes (e.g. RDF/OWL). I would only be interested in such a representation, if there were a clearly defined canonical decomposition of typed elements into primitives (nodes/edges/attributes). What I am trying to avoid here is the ability for multiple representations of equivalent graphs, where the equivalence is not discernible.
Canonical Representation - There should be a mechanism that allows the graph to be represented canonically (in such a way that lexical equivalence of canonical-representations could be used to determine equivalence).
Presentation Independent - I would prefer a notation that is not dependent upon the presentation of the graph. This would include spatial orientation, colors, font, etc. I am only interested in representing the data. One of the features I don't like about DOT language, DGML or SVG (at least for this particular purpose) is the focus on visual representation.
Standardized / Open / Compatible - The less implementation work that I have to do, the better. If the format is standardized and reliable tools already exist for working with the format, then it is more preferable. Accompanying this requirement is another, that the format should be highly-compatible. The proprietary nature of Microsoft's DGML is a reason for my aversion, despite the Visual Studio tooling and the fact that I work primarily with .NET (now). The fact that W3C publishes RDF standards is a motivation for considering a limited subset of RDF as a representational tool. I also appreciate GXL and GraphML, because they have well documented xml schemas, thereby facilitating the ability to integrate their data with any xml-compatible software package.
Simplicity / Readability - I appreciate human-readable syntax and ease of interpretation. I also appreciate representation that simplifies parsing. For this reason, I like GML, but I am concerned it is not mainstream enough to be a realistic choice. I would also consider JSON or YAML for readability, if they were not so limited in their respective abilities to represent complex (non-DAG) structures.
Efficiency / Concise Representation - It's worth considering that whatever format I end up choosing will inevitably have to be persisted and transferred over some network. Therefore, file size is a relevant consideration.
Overview
I recognize that I will most likely be unable to find a solution that satisfies every criteria on my wishlist. I am simply asking for the file format that is closest to what I want and that doesn't limit extensibility for unsupported use cases.
ObWindyPreamble: in the RDF world, there are a gazillion different surface syntax formats to choose from. RDF itself is an abstract metamodel for data, not directly a "graph syntax". You can of course directly represent a graph in RDF (since RDF models are graphs), but given that you want to represent different kinds of graphs you may end up with having to abstract away, and actually create an RDF vocabulary for representing different types of graphs.
All in all, I'm not convinced that RDF is the best way to go for you, but if you'd choose one, I'd say that RDF's Turtle syntax is something worth looking into. It certainly ticks the readability and simplicity boxes, as well as being a standard (well, almost... W3C is working on standardizing it) and having wide (open-source) tool support.
RDF models roughly follow set semantics, which means that a canonical syntax representation can not really be enforced: two files can have information in a different order without it affecting the actual model, or even can contain duplicate information. However, if you enforce a simple sorting algorithm when producing files (something for which most RDF parsers/writers have support), you should be able to get away with doing line-based comparisons and determining graph equivalence based on surface syntax.
Just as a simple example, let's assume we have a very simple, directed, labeled graph:
A ---r1---> B ---r2---> C
You could represent this directly in RDF, as follows (using Turtle syntax):
#prefix : <http://example.org/> .
:A :r1 :B .
:B :r2 :C .
In a more abstract modeling, you could do something like this:
#prefix g: <http://example.org/graph-model/> .
#prefix : <http://example.org/> .
:A a g:Vertex .
:B a g:Vertex .
:C a g:Vertex .
:r1 a g:DirectedEdge ;
g:from :A ;
g:to :B .
:r2 a g:DirectedEdge ;
g:from :B ;
g:to :C .
The above is just a simplistic example of course, but hopefully it illustrates that this potentially meets quite a few of the things on your wish list.
By the way, if you want even simpler, N-Triples is also an RDF syntax, which is line-based and therefore easy to process in a streaming fashion. It's slightly more verbose than Turtle but it may make file comparison easier.
My thoughts:
What I'm missing is your particular practical purpose/domain.
You mention the generic JSON format next to specific formats (e.g. GraphML which is an application of XML). So I'm left with the question if you do or don't consider making your own format.
Wouldn't having a 'canonical representation that can be used to determine equivalence' solve the graph isomorphism problem?
GraphML seems to cover a lot of your theoretical requirements, so I'd suggest you create a JSON version of this. This would then also cover requirement 6.
Then, you could create a converter between the JSON format and GraphML (and possibly other formats).
For your requirement 7 it again all depends on the practical graph sizes. I mean, nowadays sending up to a few MB to a friggin mobile device is not considered much. A graph of a few MB in (about) any format you mention, is already a relatively large beast with tens of thousands of nodes & edges.
What about Trivial Graph Format:
In a system I Have a list of nodes which are connected like in a normal graph. We know the whole system and all of their connections and we also have a startpoint. All my edges has a direction.
Now I want to draw all of these nodes and edges automatically. The problem is not the actual drawing, but calculating the (x,y) coordinates. So basically I would like to draw this whole graph so it looks good.
My datastructure would be something like:
class node:
string text
List<edge> connections
There must be some well known algorithms for this problem? I haven't been able to find any, but I might be using the wrong keywords.
My thoughts:
One way would be to position our startnode at (0,0), and then have some constant which is "distance". Then for each neighbor, it would add distance to the y position, and for each node which is a neighbor, set x= distance*n.
But this will really give a lot of problems - so that's definetely not the way to go.
By far the most common approach for this is to use a force-directed layout instead of a deterministic one. The gist is that you have every node repel each other (anti-gravity) and have any connected pairs of nodes attract each other. After several iterations of a physics simulation you can get a reasonable layout.
There are many layout algorithms you can use, with vastly different results. The GraphViz fdp (Fruchterman & Reingold '91) and neato (Kamada & Kawai '89) algorithms work, but are rather old and there are much better alternatives. The Fruchterman & Reingold '91 algorithm is also available in Python in NetworkX.
Prefuse provides a ForceDirectedLayout Java class that is pretty fast and good. Hachul & Jünger '05 detail the FM^3 algorithm, which appears to do quite well in practice (Hachul & Jünger '06) and is available in C++ in Tulip.
There are tons of other open source tools to visualize graphs, like
NodeXL (C#), a great introductory tool that integrates network analysis into Excel 2007/2010 (Disclaimer: I'm an advisor for it). Other awesome tools include Gephi (Java) and Cytoscape (Java), while Pajek, UCINet, yEd and Tom Sawyer are some proprietary alternatives.
In general this is a tricky problem, especially if you want to start dealing with edge routing and making things look pretty. You might look at http://www.graphviz.org/ and using either their command line tools, or using the graphviz library to do your layout and get your x,y coordinates within your application.
Given a large scale-free graph (a social network graph), what's the best way to sample it such that the sample retains an acceptable abstraction of the properties of the original?
I have a large graph (Munmun's twitter dataset, if you know it). But I need a connected sample of that graph with a reasonably large diameter (tl;dr... reasons why on request... a diameter of 10 would be good).
The problem is that any kinda breadth-first search always is likely to come across some massively connected nodes. So I start such a search, getting the friends of all nodes which I come across. I inevitably come across some massively-connected nodes, and have to get all their friends. This is a problem because I end up with a large number of nodes which are close to each other in the graph. To make programmatic analysis feasible, I have to limit the number of nodes (and edges). The whole point of this exercise is to find shortest paths between nodes, so I'm generally interested in ALL of a node's neighbours. And that's the problem.
One hack around this is to limit the max. number of nodes connected to a user which I'm interested in. For instance, if I come across #barackobama in my breadth-first search, I ensure that I only accept some small proportion of his friends and ignore the rest. But would this hacked graph be worth a damn, or am I losing too much information in terms of finding shortest paths??
Hope that makes sense...
Several sampling methods exist, how to choose one depends (amongst other things) of the properties you want to preserve. I found the literature review (section 3) in the thesis Sampling and Inference in Complex Networks [Maiya '11] very informative, for that matter.
But you seem to have found a way of sampling your network, and now you want to find out if the sample is representative of the whole graph in terms of shortest paths. You can try to have a look at this paper: Complex Network Measurements: Estimating the Relevance of Observed Properties [Latapy & Magnien '08]. They describe a method to assess the representativeness of a sample, regarding various classic topological properties. To summarize their approach, they initially have access to the whole studied network, and simulate some sampling process on these data, with increasing sample size. They monitor how properties evolve depending on sample size, and decide of an appropriate size when the properties of interest are stable enough. Their tool is freely available online.
Edit: the only ready-to-use tool I could find online is Albatross. The associated article Albatross Sampling: Robust and Effective Hybrid Vertex Sampling for Social Graphs [Jin et al. '11] also contains a nice review of existing sampling methods, some of which are implemented in the source code they provide.
Edit 2: I needed to use Albatross on a Linux system, so I did a Java port. It's very raw, but it seems to work fine. It's available on GitHub: https://github.com/vlabatut/Albatross
I am not sure, if I understand your question correctly. I think the main question you have is, about how you can compute the shortest path of two nodes in a giant, directed graph. Creating a subsample of the graph seems to be your attempt to create an efficient solution. (But I probably misunderstood you completely.)
Perhaps this SO-Question has some pointers for you: Efficiently finding the shortest path in large graphs
The graphs in that question seem to be significantly smaller, though.
You might want to check the following: Gscaler: https://github.com/jayCool/Gscaler
This is a recent tool which produces synthetic scaled graphs.
It contains the jar file and the related paper for your reference.
I am trying to store a large list of strings in a concise manner so that they can be very quickly analyzed/searched through.
A directed acyclic word graph (DAWG) suits this purpose wonderfully. However, I do not have a list of the strings to include in the first place, so it must be incrementally buildable. Additionally, when I search through it for a string, I need to bring back data associated with the result (not just a boolean saying if it was present).
I have found information on a modification of the DAWG for string data tracking here: http://www.pathcom.com/~vadco/adtdawg.html It looks extremely, extremely complex and I am not sure I am capable of writing it.
I have also found a few research papers describing incremental building algorithms, though I've found that research papers in general are not very helpful.
I don't think I am advanced enough to be able to combine both of these algorithms myself. Is there documentation of an algorithm already that features these, or an alternative algorithm with good memory use & speed?
I wrote the ADTDAWG web page. Adding words after construction is not an option. The structure is nothing more than 4 arrays of unsigned integer types. It was designed to be immutable for total CPU cache inclusion, and minimal multi-thread access complexity.
The structure is an automaton that forms a minimal and perfect hash function. It was built for speed while traversing recursively using an explicit stack.
As published, it supports up to 18 characters. Including all 26 English chars will require further augmentation.
My advice is to use a standard Trie, with an array index stored in each node. Ya, it is going to seem infantile, but each END_OF_WORD node represents only one word. The ADTDAWG is a solution to each END_OF_WORD node in a traditional DAWG representing many, many words.
Minimal and perfect hash tables are not the sort of thing that you can just put together on the fly.
I am looking for something else to work on, or a job, so contact me, and I'll do what I can. For now, all I can say is that it is unrealistic to use heavy optimization on a structure that is subject to being changed frequently.
Java
For graph problems which require persistence, I'd take a look at the Neo4j graph DB project. Neo4j is designed to store large graphs and allow incremental building and modification of the data, which seems to meet the criteria you describe.
They have some good examples to get you going quickly and there's usually example code to get you started with most problems.
They have a DAG example with a link at the bottom to the full source code.
C++
If you're using C++, a common solution to graph building/analysis is to use the Boost graph library. To persist your graph you could maintain a file based version of the graph in GraphML (for example) and read and write to that file as your graph changes.
You may also want to look at a trie structure for this (potentially building a radix-tree). It seems like a decent 'simple' alternative structure.
I'm suggesting this for a few reasons:
I really don't have a full understanding of your result.
Definitely incremental to build.
Leaf nodes can contain any data you wish.
Subjectively, a simple algorithm.