Is there a way to group edges together into a single bus in DOT/Graphviz, while still fraying the edges near nodes?
Eg:
Changing this:
graph my_graph {
rankdir=LR
A -- B;
A -- B;
A -- B;
A -- B;
A -- B;
A -- B;
A -- B;
A -- B;
A -- B;
A -- C;
}
into something like this:
One way to do it is with dummy [shape=point] nodes, like so:
graph my_graph {
rankdir=LR;
{rank = same; B; C;}
aa [ shape=point ];
bb [ shape=point ];
A -- aa;
A -- aa;
A -- aa;
A -- aa;
A -- aa;
A -- aa;
A -- aa;
A -- aa;
A -- aa;
aa -- bb [ penwidth=4.0];
bb -- B;
bb -- B;
bb -- B;
bb -- B;
bb -- B;
bb -- B;
bb -- B;
bb -- B;
bb -- B;
A -- C;
}
This has some downsides though:
it doubles the number of edges in your source file, and makes it hard to manage.
If you have head and tail labels, then you need to place them accordingly.
It mucks up your node ranks, so you have to manually adjust all the other nodes.
The frayed ends are still rounded and taking up a lot of space. This might be fixable by setting the splines attribute, but I haven't been able to get it to work (it seems like no multiple edges between two nodes are always shown as splines no matter what the value of splines is).
Another option:
There's also the concentrate attribute, but this will only collect lines going to the same point. It will not 1) fray ends or 2) increase line thickness to represent the bus:
graph my_graph {
rankdir=LR
concentrate=true;
A -- B;
A -- B;
A -- B;
A -- B;
A -- B;
A -- B;
A -- B;
A -- B;
A -- B;
A -- C;
}
I have several related sub-graphs that I want to draw together in GraphViz. When I draw simple nodes it looks quite pretty:
Source:
digraph {
rankdir=LR;
A1 -> A21;
A1 -> A22;
A1 -> A23;
A1 -> A24;
B1 -> B21;
B1 -> B22;
B1 -> B23;
B1 -> B24;
A21 -> A31;
A22 -> A31;
A23 -> A31;
A23 -> A32;
B21 -> B31;
B21 -> B32;
B22 -> B32;
B21 -> B33;
B23 -> B33;
}
As the nodes in the same level across the several sub-graphs are related, I want to group them and give it a label.
I tried to do it using clusters, but it "squeezes" the nodes:
Source:
digraph {
rankdir=LR;
subgraph cluster_level1 {
label = "Level 1";
style=filled;
color=lightgrey;
A1;
B1;
}
subgraph cluster_level2 {
label = "Level 2";
style=filled;
color=lightgrey;
A21;
A22;
A23;
A24;
B21;
B22;
B23;
B24;
}
subgraph cluster_level3 {
label = "Level 3";
style=filled;
color=lightgrey;
A31;
A32;
B31;
B32;
B33;
}
A1 -> A21;
A1 -> A22;
A1 -> A23;
A1 -> A24;
B1 -> B21;
B1 -> B22;
B1 -> B23;
B1 -> B24;
A21 -> A31;
A22 -> A31;
A23 -> A31;
A23 -> A32;
B21 -> B31;
B21 -> B32;
B22 -> B32;
B21 -> B33;
B23 -> B33;
}
With just two sub-graphs, it's bad, but still not horrible. However, if I add more sub-graphs, it becomes uglier and uglier.
Is there a way to group nodes with some shadowing and labels, while keeping the original nodes layout using GraphViz?
This is probably not a good answer since it implies a lot of trial and error but at least you get what you want (I guess) with invisible nodes:
digraph {
rankdir=LR;
subgraph cluster_level1 {
label = "Level 1";
style=filled;
color=lightgrey;
A01[ style = invis ];
A1;
A02[ style = invis ];
A03[ style = invis ];
A06[ style = invis ];
A05[ style = invis ];
B1;
A04[ style = invis ];
}
subgraph cluster_level2 {
label = "Level 2";
style=filled;
color=lightgrey;
A21;
A22;
A23;
A24;
B21;
B22;
B23;
B24;
}
subgraph cluster_level3 {
label = "Level 3";
style=filled;
color=lightgrey;
A07[ style = invis ];
A31;
A32;
A08[ style = invis ];
B31;
B32;
B33;
A01[ style = invis ];
A09[ style = invis ];
}
A1 -> A21;
A1 -> A22;
A1 -> A23;
A1 -> A24;
B1 -> B21;
B1 -> B22;
B1 -> B23;
B1 -> B24;
A21 -> A31;
A22 -> A31;
A23 -> A31;
A23 -> A32;
B21 -> B31;
B21 -> B32;
B22 -> B32;
B21 -> B33;
B23 -> B33;
}
yields
If I run graphviz on this digraph:
digraph G {
subgraph cluster_0 {
style=filled;
color=lightgrey;
node [style=filled,color=white];
a0; a1; a2; a3;
label = "sources";
}
subgraph cluster_1 {
style=filled;
color=lightgrey;
node [style=filled,color=white];
b0; b1; b2; b3;
label = "intermediaries";
}
a0 -> b0; a1 -> b0;
a0 -> b1; a1 -> b1;
a2 -> b2; b0 -> b2;
b1 -> b2; a3 -> b3;
b0 -> b3; b1 -> b3;
}
I get
with many edges intersecting the "intermediaries" label. How do I get graphviz to make edges avoid labels?
As a workaround:
digraph G {
subgraph cluster_0 {
style=filled;
color=lightgrey;
node [style=filled,color=white];
a0; a1; a2; a3;
label = "sources";
}
subgraph cluster_1 {
style=filled;
color=lightgrey;
node [style=filled,color=white];
nodelabel [label="intermediaries"; style=filled; color=lightgrey]
nodelabel -> b1 [style=invis];
nodelabel -> b0 [style=invis];
b0; b1; b2; b3;
}
a0 -> b0; a1 -> b0;
a0 -> b1; a1 -> b1;
a2 -> b2; b0 -> b2;
b1 -> b2; a3 -> b3;
b0 -> b3; b1 -> b3;
}
produce:
I have code to display two subgraphs:
graph {
rankdir=LR;
subgraph cluster01 {
label="t=0"
a0 [label="A"];
a1 [label="B"];
a2 [label="C"];
a5 [label="E"];
a0 -- a1;
a1 -- a2 ;
a2 -- a0;
};
subgraph cluster02
{
label="t=10"
b0 [label="A"];
b5 [label="E"];
b1 [label="B"];
b2 [label="C"];
b0 -- b1;
b2 -- b5;
};
a0--b0 [style=dotted];
a1--b1 [style=dotted];
a2--b2 [style=dotted];
a5--b5 [style=dotted];
}
This code displays two subgraphs like this:
But I want to have it like this:
I hope someone will help me fix the "rankdir" to get it done.
The following was achieved by using invisible edges and constraint=false on some edges:
graph {
rankdir=LR;
subgraph cluster01 {
label="t=0";
a0 [label="A"];
a1 [label="B"];
a2 [label="C"];
a5 [label="E"];
a0 -- a1;
a1 -- a2;
a2 -- a5 [style=invis];
a2 -- a0 [constraint=false];
};
subgraph cluster02
{
label="t=10"
b0 [label="A"];
b5 [label="E"];
b1 [label="B"];
b2 [label="C"];
b0 -- b1;
b1 -- b2 [style=invis];
b2 -- b5;
};
edge[constraint=false];
a0--b0 [style=dotted];
a1--b1 [style=dotted];
a2--b2 [style=dotted];
a5--b5 [style=dotted];
}
I have a data.frame named "d" of ~1,300,000 lines and 4 columns and another data.frame named "gc" of ~12,000 lines and 2 columns (but see the smaller example below).
d <- data.frame( gene=rep(c("a","b","c"),4), val=rnorm(12), ind=c( rep(rep("i1",3),2), rep(rep("i2",3),2) ), exp=c( rep("e1",3), rep("e2",3), rep("e1",3), rep("e2",3) ) )
gc <- data.frame( gene=c("a","b","c"), chr=c("c1","c2","c3") )
Here is how "d" looks like:
gene val ind exp
1 a 1.38711902 i1 e1
2 b -0.25578496 i1 e1
3 c 0.49331256 i1 e1
4 a -1.38015272 i1 e2
5 b 1.46779219 i1 e2
6 c -0.84946320 i1 e2
7 a 0.01188061 i2 e1
8 b -0.13225808 i2 e1
9 c 0.16508404 i2 e1
10 a 0.70949804 i2 e2
11 b -0.64950167 i2 e2
12 c 0.12472479 i2 e2
And here is "gc":
gene chr
1 a c1
2 b c2
3 c c3
I want to add a 5th column to "d" by incorporating data from "gc" that match with the 1st column of "d". For the moment I am using sapply.
d$chr <- sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )
But on the real data, it takes a "very long" time (I am running the command with "system.time()" since more than 30 minutes and it's still not finished).
Do you have any idea of how I could rewrite this in a clever way? Or should I consider using plyr, maybe with the "parallel" option (I have four cores on my computer)? In such a case, what would be the best syntax?
Thanks in advance.
I think you can just use the factor as index:
gc[ d[,1], 2]
[1] c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3
Levels: c1 c2 c3
does the same as:
sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )
[1] c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3
Levels: c1 c2 c3
But is much faster:
> system.time(replicate(1000,sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )))
user system elapsed
5.03 0.00 5.02
>
> system.time(replicate(1000,gc[ d[,1], 2]))
user system elapsed
0.12 0.00 0.13
Edit:
To expand a bit on my comment. The gc dataframe requires one row for each level of gene in the order of the levels for this to work:
d <- data.frame( gene=rep(c("a","b","c"),4), val=rnorm(12), ind=c( rep(rep("i1",3),2), rep(rep("i2",3),2) ), exp=c( rep("e1",3), rep("e2",3), rep("e1",3), rep("e2",3) ) )
gc <- data.frame( gene=c("c","a","b"), chr=c("c1","c2","c3") )
gc[ d[,1], 2]
[1] c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3
Levels: c1 c2 c3
sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )
[1] c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3 c1
Levels: c1 c2 c3
But it is not hard to fix that:
levels(gc$gene) <- levels(d$gene) # Seems redundant as this is done right quite often automatically
gc <- gc[order(gc$gene),]
gc[ d[,1], 2]
[1] c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3 c1
Levels: c1 c2 c3
sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )
[1] c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3 c1
Levels: c1 c2 c3
An alternative solution that does not beat Sasha's approach timing-wise, but is more generalizable and readable, is to simply merge the two data frames:
d <- merge(d, gc)
I have a slower system, so here are my timings:
> system.time(replicate(1000,sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )))
user system elapsed
11.22 0.12 11.86
> system.time(replicate(1000,gc[ d[,1], 2]))
user system elapsed
0.34 0.00 0.35
> system.time(replicate(1000, merge(d, gc, by="gene")))
user system elapsed
3.35 0.02 3.40
The benefit is that you could have multiple keys, fine control over non-matching items, etc.