How to read text files with Graphviz? - graphviz

For example, I have put the weights of the edges in the txt file below:
0 1 2
1 0 3
2 3 0
The desired result is:
{
node0--node1[weight=1];
node0--node2[weight=2];
node1--node2[weight=3];
}

Graphviz understands text files with DOT language, not DSV-like formats as in your example.
So you need to convert your file to DOT notation and that's a topic for a another question, but as a proof of concept and illustration of the fact (as mentioned in the comments to your question) that Graphviz may not be the best solution for visualization 50 000 000 edges in one image (you can ask app for clustering on the Software Recommendations website), I will show an example of generating a text file with random weights for ~20 000 edges between 200 nodes, then reading that file and converting it to DOT language (by graphviz 0.19.1 package) and producing output .dot/.gv, .canon or .png file formats. The program was executed within 5 minutes.
Python 3.9.1 script:
import graphviz as gv
import random
# Count of nodes:
size = 200
# Range of edges weights (upper limit):
weights_range = 10
# Generate txt file with random weights:
arr = [['0' for x in range(size)] for y in range(size)]
col = 1
for row in range (0, size):
i = col
while i < size:
arr[row][i] = str(random.randrange(weights_range))
arr[i][row] = arr[row][i]
i+=1
col+=1
with open('data.txt', 'w+') as file:
for row in range(0, len(arr)):
file.write(' '.join(arr[row]) +'\n')
# Create empty graph object:
Gr = gv.Graph(format='canon', # For PNG output change to 'png'.
filename='graph.gv',
engine='neato', # Layout engine.
graph_attr=dict(splines='true',
sep='5',
overlap='scale'),
node_attr=dict(shape='circle',
margin='0',
fontsize='10',
width='0.2',
height='0.2',
fixedsize='true'),
edge_attr=dict(arrowsize='0.4'))
# Read data from file and add nodes/edges to graph object:
file = open('data.txt', 'r')
row = 0
col_shift = 1 # To shift the beginning of each line to the right,
# so as not to go over the diagonal line of zeros.
for line in file:
line = line.strip()
if len(line) == 0:
continue
weights = line.split(' ')
col=col_shift
while col < len(weights):
# Variant 1. Long names for nodes:
# Gr.edge('node'+str(row),'node'+str(col),weight=weights[col])
# Variant 2. Short names for nodes:
Gr.edge(str(row),str(col),weight=weights[col])
col+=1
row+=1
col_shift+=1
file.close()
Gr.render() # Rendering graph to text file or image.
Result .canon (19 916 lines, ~500 KB) (I replaced the node-naming node42 output with 42 to reduce the file size):
graph {
graph [overlap=scale,
sep=5,
splines=true
];
node [fixedsize=true,
fontsize=10,
height=0.2,
label="\N",
margin=0,
shape=circle,
width=0.2
];
edge [arrowsize=0.4];
0 -- 1 [weight=8];
0 -- 2 [weight=1];
0 -- 3 [weight=5];
0 -- 4 [weight=2];
0 -- 5 [weight=7];
...
196 -- 198 [weight=0];
196 -- 199 [weight=3];
197 -- 198 [weight=0];
197 -- 199 [weight=5];
198 -- 199 [weight=9];
}
Result PNG (~20 000 edges between 200 nodes, ~5 MB):
P.S. However, if you want to create such an image, the following questions may help you properly adjust graph settings for faster rendering and clear picture (these are optional and not exactly related to current question, so I don't want SO to auto-link them to current question, so I give their numbers instead of links):
GraphViz Dot very long duration of generation
/questions/10766100
laying out a large graph with graphviz
/questions/13417411
Graphviz is too slow
/questions/64130276
Is it possible to generate a small GraphViz chart?
/questions/68321537
Compacting a digraph in Graphviz using Dot Language
/questions/8610710
Reducing the size (as in area) of the graph generated by graphviz
/questions/3428448

Related

Finding group sizes in matrices

So i was wondering, is there an easy way to detect the sizes of adjacent same values in a matrix? For example, when looking at the matrix of values between 0 and 12 below:
The size of the group at [0,4] is 14 because there are 14 5's connected to each other. But the 1 and 4 are not connected.
I think you can use a breath first search (well kind of, try to visualize the matrix as a tree)
Here's a pseudo python implementation. that does this. Would this work for you? Did you have a complexity in mind?
Code
visited_nodes = set()
def find_adjacent_vals(target_val, cell_row, cell_column):
if inside_matrix(cell_row, cell_column)
cell = matrix(cell_row, cell_column)
if cell not in visited_nodes:
visited_nodes.add(cell)
if cell.value == target_val:
return (1 +
find_adjacent_vals(target_val, cell_row + 1, cell_column) # below
+find_adjacent_vals(target_val, cell_row - 1, cell_column) # above
+find_adjacent_vals(target_val, cell_row, cell_column -1) # left
+find_adjacent_vals(target_val, cell_row, cell_column +1) # right
))
print "Adjacent values count: " + str(find_adjacent_vals(target_val, target_row, target_column))
Explanation
Let's say you start at a node, you start branching out visiting nodes you haven't visited before. You do this till you encounter no new cells of the same value. And each node is guaranteed to have only 1 parent node thanks to the set logic. Therefore no cell is double counted.

MATLAB - Permutations of random indices in specific areas of a grid

I have a problem in which I have 4 objects (1s) on a 100x100 grid of zeros that is split up into 16 even squares of 25x25.
I need to create a (16^4 * 4) table where entries listing all the possible positions of each of these 4 objects across the 16 submatrices. The objects can be anywhere within the submatrices so long as they aren't overlapping one another. This is clearly a permutation problem, but there is added complexity because of the indexing and the fact that the positions ned to be random but not overlapping within a 16th square. Would love any pointers!
What I tried to do was create a function called "top_left_corner(position)" that returns the subscript of the top left corner of the sub-matrix you are in. E.g. top_left_corner(1) = (1,1), top_left_corner(2) = (26,1), etc. Then I have:
pos = randsample(24,2);
I = pos(1)+top_left_corner(position,1);
J = pos(2)+top_left_corner(position,2);
The problem is how to generate and store permutations of this in a table as linear indices.
First using ndgrid cartesian product generated in the form of a [4 , 16^4] matrix perm. Then in the while loop random numbers generated and added to perm. If any column of perm contains duplicated random numbers ,random number generation repeated for those columns until no column has duplicated elements.Normally no more than 2-3 iterations needed. Since the [100 ,100] array divided into 16 blocks, using kron an index pattern like the 16 blocks generated and with the sort function indexes of sorted elements extracted. Then generated random numbers form indexes of the pattern( 16 blocks).
C = cell(1,4);
[C{:}]=ndgrid(0:15,0:15,0:15,0:15);
perm = reshape([C{:}],16^4,4).';
perm_rnd = zeros(size(perm));
c = 1:size(perm,2);
while true
perm_rnd(:,c) = perm(:,c) * 625 +randi(625,4,numel(c));
[~ ,c0] = find(diff(sort(perm_rnd(:,c),1),1,1)==0);
if isempty(c0)
break;
end
%c = c(unique(c0));
c = c([true ; diff(c0)~=0]);
end
pattern = kron(reshape(1:16,4,4),ones(25));
[~,idx] = sort(pattern(:));
result = idx(perm_rnd).';

problem with power.roc.test in R

I anlaysing several different ROC analyses in my article. Therefore I am investigating whether my sample size is appropriate. I have created a data frame which consists all combinations of possible sample sizes for ROC analysis.
str(auc)
'data.frame': 93 obs. of 2 variables:
$ cases : int 10 11 12 13 14 15 16 17 18 19 ...
$ controls: int 102 101 100 99 98 97 96 95 94 93 ...
My aim is to create line plot cases/controls (ie. kappa) versus optimal AUC
Hence I would like to create third variable using power.roc.test to calculate optimal AUC
I ran to problem above, where lies to problem?
auc$auc<-power.roc.test(sig.level=.05,power=.8,ncases=auc$cases,ncontrols=auc$controls)$auc
Error in value[[3L]](cond) : AUC could not be solved:
Error in uniroot(power.roc.test.optimize.auc.function, interval = c(0.5, : invalid function value in 'zeroin'
In addition: Warning messages:
1: In if (is.na(f.lower)) stop("f.lower = f(lower) is NA") :
the condition has length > 1 and only the first element will be used
2: In if (is.na(f.upper)) stop("f.upper = f(upper) is NA") :
the condition has length > 1 and only the first element will be used
3: In if (f.lower * f.upper > 0) stop("f() values at end points not of opposite sign") :
the condition has length > 1 and only the first element will be used
I believe you are using the pROC package. The error message is not especially helpful here, but you basically need to pass scalar values, including to ncases and ncontrols.
power.roc.test(sig.level=.05,power=.8,ncases=10, ncontrols=102)
You can wrap that in some apply loop:
auc$auc<- apply(auc, 1, function(line) {
power.roc.test(sig.level=.05, power=.8, ncases=line[["cases"]], ncontrols=line[["controls"]])$auc
})
Then you will be able to plot this however you want:
plot(auc$cases / auc$controls, auc$auc, type = "l")
Note that the AUC here is not the "optimal AUC", but the AUC at which you can expect the given power at the given significance level with the given sample size for a test of the significance of the AUC (H0: AUC = 0.5). Note that you won't be able to perform this test with pROC anyway.

gnuplot: label x and y-axis of matrix (heatmap) with row and column names

I'm absolutely new to gnuplot and did not find a working solution after googling.
I have a data matrix looking something like this:
A B C D E
A 0 2 3 4 5
B 6 0 8 9 0
C 1 2 0 4 5
D 6 7 8 0 0
E 1 2 3 4 0
What I would like to do is plotting a heatmap with plot 'result.csv' matrix with image, with the x-axis labeled with A-E and the y-axis labeled with A-E. This matrix does not always have the same size, but the number of row always equals the number of cols and it's always labeled.
Could someone help me out with this? I guess I have to use the using command, but up to now this appears like complete voodoo to me...
Thanks a lot!
This one is actually a doosie and I can't make it happen without some *nix shell magic.
First, we want to get the x tic labels and y tic labels:
XTICS="`awk 'BEGIN{getline}{printf "%s ",$1}' test.dat`"
YTICS="`head -1 test.dat`"
At this point, XTICS is the string "F G H I J" and YTICS is the string "A B C D E".
Now, we want to set the xtics by iteration:
set for [i=1:words(XTICS)] xtics ( word(XTICS,i) i-1 )
set for [i=1:words(YTICS)] ytics ( word(YTICS,i) i-1 )
We've used 2 gnuplot builtin functions (word and words). words(string) counts how many words there are in the given string (a word is a character sequence separated by spaces). word(string,n) returns the n'th word in the string.
Now, we can plot your datafile ... The only problem is that matrix wants to use all rows and columns in your datafile. You might be able to cut down the rows/columns actually read by using the every keyword, but I don't know how to do that on matrix files -- and I think it's probably easier to just keep on relying on shell utilities (awk and sed)
plot "<awk '{$1=\"\"}1' test.dat | sed '1 d'" matrix w image
#######^ replace the first field with nothing
################################## ^ delete first line
And now your plot (hopefully) looks the way you want it to.
Also note that since we have used iteration, this script will only work in gnuplot 4.3 or higher -- Since the current stable is 4.6, hopefully that's Ok.
ok, this question is old, but there is a much easier way implemented in gnuplot already:
plot 'result.csv' matrix rowheaders columnheaders with image
got it from http://gnuplot.sourceforge.net/demo/heatmaps.html

Efficient method to get one number, which can't be generated from any XORing combination

If there is any number in the range [0 .. 264] which can not be generated by any XOR composition of one or more numbers from a given set, is there a efficient method which prints at least one of the unreachable numbers, or terminates with the information, that there are no unreachable numbers?
Does this problem have a name? Is it similar to another problem or do you have any idea, how to solve it?
Each number can be treated as a vector in the vector space (Z/2)^64 over Z/2. You basically want to know if the vectors given span the whole space, and if not, to produce one not spanned (except that the span always includes the zero vector – you'll have to special case this if you really want one or more). This can be accomplished via Gaussian elimination.
Over this particular vector space, Gaussian elimination is pretty simple. Start with an empty set for the basis. Do the following until there are no more numbers. (1) Throw away all of the numbers that are zero. (2) Scan the lowest bits set of the remaining numbers (lowest bit for x is x & ~(x - 1)) and choose one with the lowest order bit set. (3) Put it in the basis. (4) Update all of the other numbers with that same bit set by XORing it with the new basis element. No remaining number has this bit or any lower order bit set, so we terminate after 64 iterations.
At the end, if there are 64 elements, then the subspace is everything. Otherwise, we went fewer than 64 iterations and skipped a bit: the number with only this bit on is not spanned.
To special-case zero: zero is an option if and only if we never throw away a number (i.e., the input vectors are independent).
Example over 4-bit numbers
Start with 0110, 0011, 1001, 1010. Choose 0011 because it has the ones bit set. Basis is now {0011}. Other vectors are {0110, 1010, 1010}; note that the first 1010 = 1001 XOR 0011.
Choose 0110 because it has the twos bit set. Basis is now {0011, 0110}. Other vectors are {1100, 1100}.
Choose 1100. Basis is now {0011, 0110, 1100}. Other vectors are {0000}.
Throw away 0000. We're done. We skipped the high order bit, so 1000 is not in the span.
As rap music points out you can think of the problem as finding a base in a vector space. However, it is not necessary to actually solve it completely, just to find if it is possible to do or not, and if not: give an example value (that is a binary vector) that can not be described in terms of the supplied set.
This can be done in O(n^2) in terms of the size of the input set. This should be compared to Gauss elimination which is O(n^3), http://en.wikipedia.org/wiki/Gaussian_elimination.
64 bits are no problem at all. With the example python code below 1000 bits with a set with 1000 random values from 0 to 2^1000-1 takes about a second.
Instead of performing Gauss elimination it's enough to find out if we can rewrite the matrix of all bits on triangular form, such as: (for the 4 bit version:)
original triangular
1110 14 1110 14
1011 11 111 7
111 7 11 3
11 3 1 1
1 1 0 0
The solution works like this: First all original values with the same most significant bit are places together in a list of lists. For our example:
[[14,11],[7],[3],[1],[]]
The last empty entry represents that there were no zeros in the original list. Now, take a value from the first entry and replace that entry with a list containing only that number:
[[14],[7],[3],[1],[]]
and then store the xor of the kept number with all the removed entries at the right place in the vector. For our case we have 14^11 = 5 so:
[[14],[7,5],[3],[1],[]]
The trick is that we do not need to scan and update all other values, just the values with the same most significant bit.
Now process the item 7,5 in the same way. Keep 7, add 7^5 = 2 to the list:
[[14],[7],[3,2],[1],[]]
Now 3,2 leaves [3] and adds 1 :
[[14],[7],[3],[1,1],[]]
And 1,1 leaves [1] and adds 0 to the last entry allowing values with no set bit:
[[14],[7],[3],[1],[0]]
If in the end the vector contains at least one number at each vector entry (as in our example) the base is complete and any number fits.
Here's the complete code:
# return leading bit index ir -1 for 0.
# example 1 -> 0
# example 9 -> 3
def leadbit(v):
# there are other ways, yes...
return len(bin(v))-3 if v else -1
def examinebits(baselist,nbitbuckets):
# index 1 is least significant bit.
# index 0 represent the value 0
bitbuckets=[[] for x in range(nbitbuckets+1)]
for j in baselist:
bitbuckets[leadbit(j)+1].append(j)
for i in reversed(range(len(bitbuckets))):
if bitbuckets[i]:
# leave just the first value of all in bucket i
bitbuckets[i],newb=[bitbuckets[i][0]],bitbuckets[i][1:]
# distribute the subleading values into their buckets
for ni in newb:
q=bitbuckets[i][0]^ni
lb=leadbit(q)+1
if lb:
bitbuckets[lb].append(q)
else:
bitbuckets[0]=[0]
else:
v=2**(i-1) if i else 0
print "bit missing: %d. Impossible value: %s == %d"%(i-1,bin(v),v)
return (bitbuckets,[i])
return (bitbuckets,[])
Example use: (8 bit)
import random
nbits=8
basesize=8
topval=int(2**nbits)
# random set of values to try:
basel=[random.randint(0,topval-1) for dummy in range(basesize)]
bl,ii=examinebits(basel,nbits)
bl is now the triangular list of values, up to the point where it was not possible (in that case). The missing bit (if any) is found in ii[0].
For the following tried set of values: [242, 242, 199, 197, 177, 177, 133, 36] the triangular version is:
base value: 10110001 177
base value: 1110110 118
base value: 100100 36
base value: 10000 16
first missing bit: 3 val: 8
( the below values where not completely processed )
base value: 10 2
base value: 1 1
base value: 0 0
The above list were printed like this:
for i in range(len(bl)):
bb=bl[len(bl)-i-1]
if ii and len(bl)-ii[0] == i:
print "example missing bit:" ,(ii[0]-1), "val:", 2**(ii[0]-1)
print "( the below values where not completely processed )"
if len(bb):
b=bb[0]
print ("base value: %"+str(nbits)+"s") %(bin(b)[2:]), b

Resources