How do I make the arrow reach two tags in mermaid? - mermaid

I have created this flowchart using mermaid.
flowchart TD
id1(( ))
id1-->Tag1
Tag1-->Tag2
subgraph Tag2
id2(( ))
id2-->Tag3
Tag3-->Tag4
subgraph Tag4
id3(( ))
id3-->Tag5
Tag5-->Tag6
Tag6-->Tag7
end
Tag7-->Tag8
end
Tag8-->Tag9
My problem is that the arrow from Tag1 to Tag2 isn't long enough. How do I go about with it? I tried the below, but the connection should be from Tag1 to Tag2, not to the beginning of the black circle after Tag2.
id1(( ))
id1-->Tag1
Tag1-->id2(( )) # Tag1-->Tag2 in original
subgraph Tag2
id2(( ))
id2-->Tag3
Tag3-->Tag4
subgraph Tag4
id3(( ))
id3-->Tag5
Tag5-->Tag6
Tag6-->Tag7
end
Tag7-->Tag8
end
Tag8-->Tag9

I tried you first following code on stackedit
```mermaid
flowchart TD
id1(( ))
id1-->Tag1
Tag1-->Tag2
subgraph Tag2
id2(( ))
id2-->Tag3
Tag3-->Tag4
subgraph Tag4
id3(( ))
id3-->Tag5
Tag5-->Tag6
Tag6-->Tag7
end
Tag7-->Tag8
end
Tag8-->Tag9
```
And its result as follows:
I think this is what you want, and it could be the problem of the mermaid viewer.

Related

Set pre-defined node styles?

I've been googling around for the last 15 minutes trying to find an answer to this. But I can't seem to figure it out.
I was tasked with building some small flowcharts for some applications I've developed at work. They don't need anything fancy because they are going to convert it into their preferred format in vizio. They even said we could do it pen and paper. So I figured I would play around with graphviz/dot.
They have 6 pre-defined shapes/colors that they like to use, so I figured I would use them. I've already built them all in dot...but if I plan to re-use them many times, I'd like to find a way to save them as a sort of template.
Is that possible?
For example...These are the predefined shapes.
digraph G {
node [color="#4271C6"]
process [
shape=Mrecord,
style=filled, fillcolor="#E1F4FF",
label="{1. Process\l | Description}"];
subprocess [
shape=record,
style=filled, color="#FFFFFF", fillcolor="#A5A5A5",
label="| Sub-Process |"];
database [
shape=cylinder, color="#18589A",
label="Database"];
inputoutput [
shape=polygon,
style=filled, fontcolor=white,
fixedsize=true, skew=0.3, margin=0,
width=2, label="Input / Output"];
file [
shape=folder,
label="File"];
external [
shape=box3d,
label="External entity"];
}
unfortunately there is no way to define macros or objects and reuse - especially across multiple graphs. However there are ways using other tools. Some folks use m4 (the macro language) or cpp (the C pre-processor) Both work, but there are potential OS issues. Python, awk, ... would also work.
Here is a gvpr program (gvpr is part of the Graphviz package) that also does what you want (I think):
digraph pre{
a [_type=process label="{1. Process\l | Something}"]
b [_type=process label="{2. Process\l | Something else}"]
c [_type=subprocess label="do it"]
d [_type=database label="lots of data"]
e [_type=database label="a bit of data"]
f [_type=inputoutput label="inOut"]
g [_type=file label="nail file"]
h [_type=external label="outside"]
a->b->c->d->e->f->g->h
}
The gvpr program:
BEG_G{
$G.newrank="true";
}
N{
$.color="#4271C6"; // default
}
N[_type=="process"]{
$.shape="Mrecord";
$.style="filled";
$.fillcolor="#E1F4FF";
// maybe redo $.label
}
N[_type=="subprocess"]{
$.shape="record";
$.style="filled";
$.color="#FFFFFF";
$.fillcolor="#A5A5A5";
$.label=sprintf("|%s|", $.label); // embed in pipes
}
N[_type=="database"]{
$.shape="cylinder";
$.color="#18589A";
}
N[_type=="inputoutput"]{
$.shape="polygon";
$.style='filled';
$.fontcolor="white",
$.ixedsize="true";
$.skew="0.3";
$.margin="0";
$.width="2";
}
N[_type=="file"]{
$.shape="folder";
}
N[_type=="external"]{
$.shape="box3d";
}
Produces:
There may currently be problems with gvpr on Windows, but I know the development team is working on it
Here is the command line:
gvpr -c -f predefined.gvpr predefined2.gv | dot -Tpng > predefined2.png
Okay, so I figured it out. I didn't realize you could do this...but apparently you can break up a node definition into multiple parts...so this is what I came up with, which solves my problem...
I have a "Styles" section that goes at the top. Here I can define each node style. I use comments as a way of naming them. And I don't need to copy paste, because I can just define multiple nodes as a comma separated list.
I also found that you can put them into subgraphs as well, like subgraph style_file {...}. But it seemed simpler to just use a comment as a way to name the style.
digraph G {
newrank=true;
///////////////////////////////////////////////////////////
// Styles
///////////////////////////////////////////////////////////
node [color="#4271C6"];
edge [color="#4271C6"];
//process
createfile, uploadfile
[shape=Mrecord, style=filled, fillcolor="#E1F4FF"];
//subprocess
exportfile, wait
[shape=record, style=filled, color="#FFFFFF", fillcolor="#A5A5A5"];
//external
ftp
[shape=box3d];
//datastore
database
[shape=cylinder, color="#18589A"];
//io
exportproc
[shape=polygon, style=filled, fontcolor=white, margin=0, width=3.1, fixedsize=true, skew=0.3];
//file
workfile
[shape=folder];
///////////////////////////////////////////////////////////
// Clusters
///////////////////////////////////////////////////////////
subgraph cluster_0 {
createfile [label="{1. Process\l | Create file}"];
exportfile [label="|Export Data\nfrom DB|"];
database [label="Database"];
exportproc [label="Export Data"];
workfile [label="Generated file\n(Archived on server)"];
}
subgraph cluster_1 {
uploadfile [label="{2. Process\l | Upload file}"];
ftp [label="FTP Server"];
wait [label="|Wait for\nresponse file|"];
}
///////////////////////////////////////////////////////////
// Relationships
///////////////////////////////////////////////////////////
{
rank=same;
createfile;
uploadfile;
}
///////////////////////////////////////////////////////////
// Relationships
///////////////////////////////////////////////////////////
# cluster_0
createfile -> exportfile;
exportfile -> database;
database -> exportproc;
exportproc -> workfile [style=dashed];
workfile -> uploadfile;
# cluster_1
uploadfile -> ftp [style=dashed];
ftp -> wait;
}
Which produces this:
No affiliation, but the Excel to Graphviz application can create re-usable styles as can be seen in this screenshot:

Html Horizontal line error graphviz

I'm needing to create subgraph cluster have a label with line separation from nodes.
subgraph cluster_0{
label=< <B>process #1</B> <HR/> >
node [shape=none]
t1 [label="label1"]
t2 [label="label2"]
t3 [label="label 3"]
node [shape=box group=a style=filled fillcolor="red;.5:white" height=.2 label = "" ]
A [ fillcolor="red;0.3:white" ]
B [fillcolor="red;.9:white"]
C
node [shape=none fillcolor=white]
t11 [label="label1"]
t21 [label="label2"]
t31 [label="label 3"]
edge[style=invis];
A->B->C
t1->t2->t3
t11->t21->t31
}
Then I get in error on Syntax.
error stack
pydot.InvocationException: Program terminated with status: 1. stderr follows: Error: syntax error in line 1
... <HR/> ...
in label of graph cluster_0
My graphviz version is
dot - graphviz version 2.36.0 (20140111.2315)
On the graphviz web site, the page called "Node Shapes" contains a grammar (about half-way down) for html-like labels:
For <HR/>, it says:
rows : row
| rows row
| rows <HR/> row
This means that <HR/> is only allowed in between two rows. And rows are only allowed within a <TABLE>, so you'll have to wrap everything in a table and then it may work.
Depending on what exactly you'd like to achieve, an other possible solution might be to simply underline the label using <U>text</U>.

Closest value different files, with different number of lines and other conditions ( bash awk other)

I have to revive and old question with a modification for long files.
I have the age of two stars in two files (File1 and File2). The column of the age of the stars is $1 and the rest of the columns up to $13 are information that I need to print at the end.
I am trying to find an age in which the stars have the same age or the closest age. Since the files are too large (~25000 lines) I don't want to search in the whole array, for speed issues.
Also, they could have a big difference in number of lines (let say ~10000 in some cases)
I am not sure if this is the best way to solve the problem, but in a lack of a better one, this is my idea. (If you have a faster and more efficient method, please do it)
All the values are with 12 decimals of precision. And for now I am only concern in the first column (where the age is).
And I need different loops.
Let's use this value from file 1:
2.326062371284e+05
First the routine should search in file2 all the matches that contain
2.3260e+05
(This loop probably will search in the whole array, but if there is a way to stop the search as soon it reaches 2.3261 then it will save some time)
If it finds just one, then the output should be that value.
Usually, it will find several lines, maybe even up to 1000. It this is the case, it should search again against
2.32606e+05
between the lines founded before. (It is a nested loop I think)
Then the number of matches will decrease up to ~200
At that moment, the routine should search the best difference with certain tolerance X between
2.326062371284e+05
and all these 200 lines.
This way having these files
File1
1.833800650355e+05 col2f1 col3f1 col4f1
1.959443501406e+05 col2f1 col3f1 col4f1
2.085086352458e+05 col2f1 col3f1 col4f1
2.210729203510e+05 col2f1 col3f1 col4f1
2.326062371284e+05 col2f1 col3f1 col4f1
2.441395539059e+05 col2f1 col3f1 col4f1
2.556728706833e+05 col2f1 col3f1 col4f1
File2
2.210729203510e+05 col2f2 col3f2 col4f2
2.354895663228e+05 col2f2 col3f2 col4f2
2.499062122946e+05 col2f2 col3f2 col4f2
2.643228582664e+05 col2f2 col3f2 col4f2
2.787395042382e+05 col2f2 col3f2 col4f2
2.921130362004e+05 col2f2 col3f2 col4f2
3.054865681626e+05 col2f2 col3f2 col4f2
Output File3 (with tolerance 3000)
2.210729203510e+05 2.210729203510e+05 col2f1 col2f2 col4f1 col3f2
2.326062371284e+05 2.354895663228e+05 col2f1 col2f2 col4f1 col3f2
Important condition:
The output shouldn't contain repeated lines (the star 1 can't have at a fixed age, different ages for the star 2, just the closest one.
How would you solve this?
super thanks!
ps: I've change completely the question, since it was showed to me that my reasoning had some errors. Thanks!
Not an awk solution, comes a time when other solutions are great too, so here is an answer using R
New answer with different datas, not reading from file this time to bake an example:
# Sample data for code, use fread to read from file and setnames to name the colmumns accordingly
set.seed(123)
data <- data.table(age=runif(20)*1e6,name=sample(state.name,20),sat=sample(mtcars$cyl,20),dens=sample(DNase$density,20))
data2 <- data.table(age=runif(10)*1e6,name=sample(state.name,10),sat=sample(mtcars$cyl,10),dens=sample(DNase$density,10))
setkey(data,'age') # Set the key for joining to the age column
setkey(data2,'age') # Set the key for joining to the age column
# get the result
result=data[ # To get the whole datas from file 1 and file 2 at end
data2[
data, # Search for each star of list 1
.SD, # return columns of file 2
roll='nearest',by=.EACHI, # Join on each line (left join) and find nearest value
.SDcols=c('age','name','dens')]
][!duplicated(age) & abs(i.age - age) < 1e3,.SD,.SDcols=c('age','i.age','name','i.name','dens','i.dens') ] # filter duplicates in first file and on difference
# Write results to a file (change separator for wish):
write.table(format(result,digits=15,scientific=TRUE),"c:/test.txt",sep=" ")
Code:
# A nice package to have, install.packages('data.table') if it's no present
library(data.table)
# Read the data (the text can be file names)
stars1 <- fread("1.833800650355e+05
1.959443501406e+05
2.085086352458e+05
2.210729203510e+05
2.326062371284e+05
2.441395539059e+05
2.556728706833e+05")
stars2 <- fread("2.210729203510e+05
2.354895663228e+05
2.499062122946e+05
2.643228582664e+05
2.787395042382e+05
2.921130362004e+05
3.054865681626e+05")
# Name the columns (not needed if the file has a header)
colnames(stars1) <- "age"
colnames(stars2) <- "age"
# Key the data tables (for a fast join with binary search later)
setkey(stars1,'age')
setkey(stars2,'age')
# Get the result (more datils below on what is happening here :))
result=stars2[ stars1, age, roll="nearest", by=.EACHI]
# Rename the columns so we acn filter whole result
setnames(result,make.unique(names(result)))
# Final filter on difference
result[abs(age.1 - age) < 3e3]
So the interesting parts are the first 'join' on the two stars ages list, searching for each in stars1 the nearest in stars2.
This give (after column renaming):
> result
age age.1
1: 183380.1 221072.9
2: 195944.4 221072.9
3: 208508.6 221072.9
4: 221072.9 221072.9
5: 232606.2 235489.6
6: 244139.6 249906.2
7: 255672.9 249906.2
Now we have the nearest for each, filter those close enough (on absolute difference above 3 000 here):
> result[abs(age.1 - age) < 3e3]
age age.1
1: 221072.9 221072.9
2: 232606.2 235489.6
Perl to the rescue. This should be very fast, as it does a binary search in the given range.
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
use List::Util qw{ max min };
use constant { SIZE => 100,
TOLERANCE => 3000,
};
my #times2;
open my $F2, '<', 'file2' or die $!;
while (<$F2>) {
chomp;
push #times2, $_;
}
my $num = 0;
open my $F1, '<', 'file1' or die $!;
while (my $time = <$F1>) {
chomp $time;
my $from = max(0, $num - SIZE);
my $to = min($#times2, $num + SIZE);
my $between;
while (1) {
$between = int(($from + $to) / 2);
if ($time < $times2[$between] && $to != $between) {
$to = $between;
} elsif ($time > $times2[$between] && $from != $between) {
$from = $between;
} else {
last
}
}
$num++;
if ($from != $to) {
my $f = $time - $times2[$from];
my $t = $times2[$to] - $time;
$between = ($f > $t) ? $to : $from;
}
say "$time $times2[$between]" if TOLERANCE >= abs $times2[$between] - $time;
}

Moving chunks of data in a file with awk

I'm moving my bookmarks from kippt.com to pinboard.in.
I exported my bookmarks from Kippt and for some reason, they were storing tags (preceded by #) and description within the same field. Pinboard keeps tags and description separated.
This is what a Kippt bookmark looks like after export:
<DT>This is a title
<DD>#tag1 #tag2 This is a description
This is what it should look like before importing into Pinboard:
<DT>This is a title
<DD>This is a description
So basically, I need to replace #tag1 #tag2 by TAGS="tag1,tag2" and move it on the first line within <A>.
I've been reading about moving chunks of data here: sed or awk to move one chunk of text betwen first pattern pair into second pair?
I haven't been to come up with a good recipe so far. Any insight?
Edit:
Here's an actual example of what the input file looks like (3 entries out of 3500):
<DT>Phabricator
<DD>#bug #tracking
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD>#iceland #tour #car #drive #self Self-driving tour of Iceland
This might not be the most beautiful solution, but since it seems to be a one-time-thing it should be sufficient.
import re
dt = re.compile('^<DT>')
dd = re.compile('^<DD>')
with open('bookmarks.xml', 'r') as f:
for line in f:
if re.match(dt, line):
current_dt = line.strip()
elif re.match(dd, line):
current_dd = line
tags = [w for w in line[4:].split(' ') if w.startswith('#')]
current_dt = re.sub('(<A[^>]+)>', '\\1 TAGS="' + ','.join([t[1:] for t in tags]) + '">', current_dt)
for t in tags:
current_dd = current_dd.replace(t + ' ', '')
if current_dd.strip() == '<DD>':
current_dd = ""
else:
print current_dt
print current_dd
current_dt = ""
current_dd = ""
print current_dt
print current_dd
If some parts of the code are not clear, just tell me. You can of course use python to write the lines to a file instead of printing them, or even modify the original file.
Edit: Added if-clause so that empty <DD> lines won't show up in the result.
script.awk
BEGIN{FS="#"}
/^<DT>/{
if(d==1) print "<DT>"s # for printing lines with no tags
s=substr($0,5);tags="" # Copying the line after "<DT>". You'll know why
d=1
}
/^<DD>/{
d=0
m=match(s,/>/) # Find the end of the HREF descritor first match of ">"
for(i=2;i<=NF;i++){sub(/ $/,"",$i);tags=tags","$i} # Concatenate tags
td=match(tags,/ /) # Parse for tag description (marked by a preceding space).
if(td==0){ # No description exists
tags=substr(tags,2)
tagdes=""
}
else{ # Description exists
tagdes=substr(tags,td)
tags=substr(tags,2,td-2)
}
print "<DT>" substr(s,1,m-1) ", TAGS=\"" tags "\"" substr(s,m)
print "<DD>" tagdes
}
awk -f script.awk kippt > pinboard
INPUT
<DT>Phabricator
<DD>#bug #tracking
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD>#iceland #tour #car #drive #self Self-driving tour of Iceland
OUTPUT:
<DT>Phabricator
<DD>
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD> Self-driving tour of Iceland

Speed up the analysis

I have 2 dataframes in R for example df and dfrefseq.
df<-data.frame( chr = c("chr1","chr1","chr1","chr4")
, start = c(843294,4329248,4329423,4932234)
, stop = c(845294,4329248,4529423,4935234)
, genenames= c("HTA","OdX","FEA","MGA")
)
dfrefseq<-data.frame( chr = c("chr1","chr1","chr1","chr2")
, start = c(843294,4329248,4329423,4932234)
, stop = c(845294,4329248,4529423,4935234)
, genenames= c("tra","FGE","FFs","FAA")
)
I want to check for each gene in df witch gene in dfrefseq lies closest to the selected df gene.
I first selected "chr1" in both dataframes.
Then I calculated for the first gene in readschr1 the distance between start-start start-stop stop-start and stop-stop sites.
The sum of this calculations say everything about the distance. My question here is, How can I speed up this analyse? Because now I tested only 1 gene against a dataframe, but I need to test 2000 genes.
readschr1 <- subset(df,df[,1]=="chr1")
refseqchr1 <- subset(dfrefseq,dfrefseq[,1]=="chr1")
names<-list()
read_start_start<-list()
read_start_stop<-list()
read_stop_start<-list()
read_stop_stop<-list()
for (i in 1:nrow(refseqchr1)) {
startstart<-abs(readschr1[1,2] - refseqchr1[i,2])
startstop<-abs(readschr1[1,2] - refseqchr1[i,3])
stopstart<-abs(readschr1[1,3] - refseqchr1[i,2])
stopstop<-abs(readschr1[1,3] - refseqchr1[i,3])
read_start_start[[i]]<- matrix(startstart)
read_start_stop[[i]]<- matrix(startstop)
read_stop_start[[i]]<- matrix(stopstart)
read_stop_stop[[i]]<- matrix(stopstop)
names[[i]]<-matrix(refseqchr1[i,4])
}
table<-cbind(names, read_start_start, read_start_stop, read_stop_start, read_stop_stop)
sumtotalcolumns<-as.numeric(table[,2]) + as.numeric(table[,3])+ as.numeric(table[,4]) + as.numeric(table[,5])
test<-cbind(table, sumtotalcolumns)
test1<-test[order(as.vector(test$sumtotalcolumns)), ]
Thank you!
The Bioconductor package GenomicRanges is designed to work with this type of data
source('http://bioconductor.org/biocLite.R')
biocLite('GenomicRanges') # one-time installation
then
library(GenomicRanges)
gr <- with(df,
GRanges(factor(chr, levels=paste("chr", 1:4, sep="")),
IRanges(start, stop), genenames=genenames))
grrefseq <- with(dfrefseq,
GRanges(factor(chr, levels=paste("chr", 1:4, sep="")),
IRanges(start, stop), genenames=genenames))
and
> nearest(gr, grrefseq)
[1] 1 2 3 NA
You can merge the two separate data.frames together to form one table and then use vectorized operations. The key to merge is to specify the common column(s) between the data.frames and to tell it what to do when there are cases that do not match. Specifying all = TRUE will return all rows and fill in NAs if there is no match in the other data.frame, i.e. ch2 and ch4 in this case. Once the data.frames have been merged, then it's a simple exercise in subtracting the different columns from one another and then summing the four columns of interest. I use transform to cut down on the typing needed to do the subtraction.
zz <- merge(df, dfrefseq, by = "chr", all = TRUE)
zz <- transform(zz,
read_start_start = abs(start.x - start.y)
, read_start_stop = abs(start.x - stop.y)
, read_stop_start = abs(stop.x - start.y)
, read_stop_stop = abs(stop.x - stop.y)
)
zz <- transform(zz,
sum_total_columns = read_start_start + read_start_stop + read_stop_start + read_stop_stop
)
Here's one approach get the row with the minimum distance. I'm assuming you want to do this by chr and genenames. I use the plyr package, but I'm sure there are base solutions if you'd prefer one of those. Maybe someone else will chime in with a base solution.
require(plyr)
ddply(zz, c("chr", "genenames.x"), function(x) x[which.min(x$sum_total_columns) ,])

Resources