Performance issues building graph with py2neo transactions - performance

I'm trying to build a full binary tree of variable depth using py2neo. I've been trying to use py2neo transactions to send create statements to the server, but the runtime awful.
Building a tree with depth 8 (255 nodes) takes about 16.7 seconds - the vast majority of that time is spent while the transaction is committing (if I Transaction.process() before committing then the processing takes the bulk of the runtime). What could be the issue? The cypher statements are each just a single Match and a node + relationship Create.
Here's the function which builds the tree
def buildBinaryTree(self):
depth = 1
tx = self.graph.begin()
g = self.graph
leaves = []
leaves.append("Root")
tx.run('CREATE(n {ruleName: "Root"})')
timeSum = 0
while depth < self.scale:
newLeaves = []
for leaf in leaves:
leftName = leaf + 'L'
rightName = leaf + 'R'
newLeaves.append(leftName)
newLeaves.append(rightName)
start = timer()
statement = ('MATCH(n {ruleName: "' + leaf + '"}) '
'WITH n CREATE (n)-[:`LINKS TO`]'
'->({ruleName: "' + leftName + '"})')
tx.run(statement)
statement = ('MATCH(n {ruleName: "' + leaf + '"}) '
'WITH n CREATE (n)-[:`LINKS TO`]'
'->(m {ruleName: "' + rightName + '"})')
tx.run(statement)
end = timer()
timeSum += (end - start)
leaves = newLeaves
depth += 1
print("Depth = " + str(depth))
print(timeSum)
start = timer()
print("Processing...")
tx.process()
print (timer() - start)
print("Committing...")
tx.commit()
print("Committed")
print (timer() - start)
And the output with scale = 8
building tree...
Depth = 2
Depth = 3
Depth = 4
Depth = 5
Depth = 6
Depth = 7
Depth = 8
0.009257960999775605
Processing...
16.753949095999815
Committing...
Committed
17.28687257200022

Try making these changes to ensure that you are truly running in a single transaction and not unnecessarily delaying committing your updates:
Change tx.run to tx.append. That will enqueue each Cypher statement instead of immediately executing it.
Delete tx.process(), since you have no subsequent Cypher statements to enqueue.
Ideally, you should also be passing parameters instead of modifying your Cypher statements every time through the loop. However, that would not be responsible for the slowness you are experiencing.

Related

Algorithm to encode Huffman tree

I have a program which encodes sequence, i.e. creates codewords using Huffman method.
I need to encode the tree itself, where node=0, leaf=1. It should be something like a binary heap i guess, where first element (0) says that it has 2 children, next two elements (ex., 00) also have two children each, next four (10 00) - have one leaf and 3 non-leaf children and so on
I have a result for given sequence, but i have no idea how to get it.
function [ ] = encodeTwoPassHuff( )
global CODE
global codeTree
codeTree=[];
clc;
inputStr='IF_WE_CANNOT_DO_AS_WE_WOULD_WE_SHOULD_DO_AS_WE_CAN';
a=unique(inputStr);
N=size(inputStr,2);
Nx = zeros(1, size(a, 2));
for i = 1:size(a,2)
for j = 1:N
if (a(i) == inputStr(j))
Nx(i) = Nx(i)+1;
end
end
end
for i = 1 : size(a, 2)
prob(i) = Nx(i) / N;
end
CODE = cell( length(prob), 1 );
p=prob;
s = cell( length(p), 1 );
for i = 1:length(p)
s{i} = i;
end
while size(s, 1) > 2
[p,i] = sort(p, 'ascend');
p(2) = p(1) + p(2);
p(1) = [];
s = s(i);
s{2} = {s{1},s{2}};
s(1) = [];
end
CODE = makecode(s, []);
fprintf('00010000010100110111101101111\n'); % encoded tree (true)
fprintf('%d', codeTree); % my result
fprintf('\n');
for i = 1:length(CODE)
len(i) = length(CODE{i});
end
% print
disp('symbol | probabil | len | codeword');
for i=1:length(prob)
fprintf('%5s\t %.4f\t %3d\t %s\n', a(i), prob(i), len(i), num2str(CODE{i}));
end
end
function [CODE]=makecode(ss, codeword)
global CODE
global codeTree
if isa(ss,'cell') % node
codeTree = [codeTree 0];
makecode( ss{1}, [codeword 1] );
makecode( ss{2}, [codeword 0] );
else % leaf
CODE{ss} = char('0' + codeword);
codeTree = [codeTree 1];
end
end
`
Normally, you just encode the lengths of the code words for each symbol. For example, if you build your tree and you get
A -> 10
B -> 0
C -> 111
D -> 110
The you just write out the lengths array like [2,1,3,3]
Of course, there are lots of trees that produce the same code lengths, but it doesn't matter which one you use, since they are all equally efficient.
The sender and receiver do have to use the same tree, though, so after writing out the lengths, the sender builds a new tree from the lengths in exactly the same way that the receiver will, e.g.,
A -> 00
B -> 1
C -> 010
D -> 011
As long as the sender and receiver build the same tree, everything works fine, and you have avoided transmitting all the redundant information that would distinguish one equivalent tree from another.

Why is my Julia shared array code running so slow?

I'm trying to implement Smith-Waterman alignment in parallel using Julia (see: Figure 1 of http://www.cs.virginia.edu/~rl6sf/paper_dump/2011:12:33:22.pdf), but the algorithm is running much slower in Julia than the serial version. I'm using shared arrays to do this and figure I am doing something silly that is making the code run slow. Could someone take a look and see if my code is optimized as possible? The parallel version should run faster than in serial….
The basic concept of it is to compute the anti-diagonal elements of a matrix in parallel from the upper left to lower right corner and to update them. I'm trying to use 32 cores on a shared array machine to do this. I have a SharedArray matrix that I am using to do this and am computing the elements of each anti-diagonal in parallel as shown below. The while loops in the spSW function submit tasks to workers in sync for each anti-diagonal using the helper function shared_get_score(). The main goal of this function is to fill in each element in the shared arrays "matrix" and "path".
function spSW(seq1,seq2,p)
indel = -1
match = 2
seq1 = "^$seq1"
seq2 = "^$seq2"
col = length(seq1)
row = length(seq2)
wl = workers()
matrix,path = shared_initialize_path(seq1,seq2)
for j = 2:col
jcol = j
irow = 2
#sync begin
count = 0
while jcol > 1 && irow < row + 1
#println(j," ",irow," ",jcol)
if seq1[jcol] == seq2[irow]
equal = true
else
equal = false
end
w = wl[(count % p) + 1]
#async remotecall_wait(w,shared_get_score!,matrix,path,equal,indel,match,irow,jcol)
jcol -= 1
irow += 1
count += 1
end
end
end
for i = 3:row
jcol = col
irow = i
#sync begin
count = 0
while irow < row+1 && jcol > 1
#println(j," ",irow," ",jcol)
if seq1[jcol] == seq2[irow]
equal = true
else
equal = false
end
w = wl[(count % p) + 1]
#async remotecall_wait(w,shared_get_score!,matrix,path,equal,indel,match,irow,jcol)
jcol -= 1
irow += 1
count += 1
end
end
end
return matrix,path
end
The other helper functions are:
function shared_initialize_path(seq1,seq2)
col = length(seq1)
row = length(seq2)
matrix = convert(SharedArray,fill(0,(row,col)))
path = convert(SharedArray,fill(0,(row,col)))
return matrix,path
end
#everywhere function shared_get_score!(matrix,path,equal,indel,match,i,j)
pathvalscode = ["-","|","M"]
pathvals = [1,2,3]
scores = []
push!(scores,matrix[i,j-1]+indel)
push!(scores,matrix[i-1,j]+indel)
if equal
push!(scores,matrix[i-1,j-1]+match)
else
push!(scores,matrix[i-1,j-1]+indel)
end
val,ind = findmax(scores)
if val < 0
matrix[i,j] = 0
else
matrix[i,j] = val
end
path[i,j] = pathvals[ind]
end
Does anyone see an obvious way to make this run faster? Right now it's about 10 times slower than the serial version.

Faster concatenation of cell arrays of different sizes

I have a cell array of size m x 1 and each cell is again s x t cell array (size varies). I would like to concatenate vertically. The code is as follows:
function(cell_out) = vert_cat(cell_in)
[row,col] = cellfun(#size,cell_in,'Uni',0);
fcn_vert = #(x)([x,repmat({''},size(x,1),max(cell2mat(col))-size(x,2))]);
cell_out = cellfun(fcn_vert,cell_in,'Uni',0); % Taking up lot of time
cell_out = vertcat(cell_out{:});
end
Step 3 takes a lot of time. Is it the right way to do or is there any another faster way to achieve this?
cellfun has been found to be slower than loops (kind of old, but agrees with what I have seen).
In addition, repmat has also been a performance hit in the past (though that may be different now).
Try this two-loop code that aims to accomplish your task:
function cellOut = vert_cat(c)
nElem = length(c);
colPad = zeros(nElem,1);
nRow = zeros(nElem,1);
for k = 1:nElem
[nRow(k),colPad(k)] = size(c{k});
end
colMax = max(colPad);
colPad = colMax - colPad;
cellOut = cell(sum(nRow),colMax);
bottom = cumsum(nRow) - nRow + 1;
top = bottom + nRow - 1;
for k = 1:nElem
cellOut(bottom(k):top(k),:) = [c{k},cell(nRow(k),colPad(k))];
end
end
My test for this code was
A = rand(20,20);
A = mat2cell(A,ones(20,1),ones(20,1));
C = arrayfun(#(c) A(1:c,1:c),randi([1,15],1,5),'UniformOutput',false);
ccat = vert_cat(c);
I used this pice of code to generate data:
%generating some dummy data
m=1000;
s=100;
t=100;
cell_in=cell(m,1);
for idx=1:m
cell_in{idx}=cell(randi(s),randi(t));
end
Applying some minor modifications, I was able to speed up the code by a factor of 5
%Minor modifications of the original code
%use arrays instead of cells for row and col
[row,col] = cellfun(#size,cell_in);
%claculate max(col) once
tcol=max(col);
%use cell instead of repmat to generate an empty cell
fcn_vert = #(x)([x,cell(size(x,1),tcol-size(x,2))]);
cell_out = cellfun(fcn_vert,cell_in,'Uni',0); % Taking up lot of time
cell_out = vertcat(cell_out{:});
Using simply a for loop is even faster, because the data is only moved once
%new approac. Basic idea: move every data only once
[row,col] = cellfun(#size,cell_in);
trow=sum(row);
tcol=max(col);
r=1;
cell_out2 = cell(trow,tcol);
for idx=1:numel(cell_in)
cell_out2(r:r+row(idx)-1,1:col(idx))=cell_in{idx};
r=r+row(idx);
end

How to improve graph centrality calculation parallel performance?

I am getting performance degradation after parallelizing the code that is calculating graph centrality. Graph is relatively large, 100K vertices. Single threaded application take approximately 7 minutes. As recommended on julialang site (http://julia.readthedocs.org/en/latest/manual/parallel-computing/#man-parallel-computing) I adapted code and used pmap api in order to parallelize calculations. I started calculation with 8 processes (julia -p 8 calc_centrality.jl). To my surprise I got 10 fold slow down. Parallel process now take more than hour. I noticed that it take several minutes for parallel process to initialize and starts calculation. Even after all 8 CPUs are %100 busy with julia app, calculation is super slow.
Any suggestion how to improve parallel performance is appreciated.
calc_centrality.jl :
using Graphs
require("read_graph.jl")
require("centrality_mean.jl")
function main()
file_name = "test_graph.csv"
println("graph creation: ", file_name)
g = create_generic_graph_from_file(file_name)
println("num edges: ", num_edges(g))
println("num vertices: ", num_vertices(g))
data = cell(8)
data[1] = {g, 1, 2500}
data[2] = {g, 2501, 5000}
data[3] = {g, 5001, 7500}
data[4] = {g, 7501, 10000}
data[5] = {g, 10001, 12500}
data[6] = {g, 12501, 15000}
data[7] = {g, 15001, 17500}
data[8] = {g, 17501, 20000}
cm = pmap(centrality_mean, data)
println(cm)
end
println("Elapsed: ", #elapsed main(), "\n")
centrality_mean.jl
using Graphs
function centrality_mean(gr, start_vertex)
centrality_cnt = Dict()
vertex_to_visit = Set()
push!(vertex_to_visit, start_vertex)
cnt = 0
while !isempty(vertex_to_visit)
next_vertex_set = Set()
for vertex in vertex_to_visit
if !haskey(centrality_cnt, vertex)
centrality_cnt[vertex] = cnt
for neigh in out_neighbors(vertex, gr)
push!(next_vertex_set, neigh)
end
end
end
cnt += 1
vertex_to_visit = next_vertex_set
end
mean([ v for (k,v) in centrality_cnt ])
end
function centrality_mean(data::Array{})
gr = data[1]
v_start = data[2]
v_end = data[3]
n = v_end - v_start + 1;
cm = Array(Float64, n)
v = vertices(gr)
cnt = 0
for i = v_start:v_end
cnt += 1
if cnt%10 == 0
println(cnt)
end
cm[cnt] = centrality_mean(gr, v[i])
end
return cm
end
I'm guessing this has nothing to do with parallelism. Your second centrality_mean method has no clue what type of objects gr, v_start, and v_end are. So it's going to have to use non-optimized, slow code for that "outer loop."
While there are several potential solutions, probably the easiest is to break up your function that receives the "command" from pmap:
function centrality_mean(data::Array{})
gr = data[1]
v_start = data[2]
v_end = data[3]
centrality_mean(gr, v_start, v_end)
end
function centrality_mean(gr, v_start, v_end)
n = v_end - v_start + 1;
cm = Array(Float64, n)
v = vertices(gr)
cnt = 0
for i = v_start:v_end
cnt += 1
if cnt%10 == 0
println(cnt)
end
cm[cnt] = centrality_mean(gr, v[i])
end
return cm
end
All this does is create a break, and gives julia a chance to optimize the second part (which contains the performance-critical loop) for the actual types of the inputs.
Below is code with #everywhere suggested in comments by https://stackoverflow.com/users/1409374/rickhg12hs. That fixed performance issue!!!
test_parallel_pmap.jl
using Graphs
require("read_graph.jl")
require("centrality_mean.jl")
function main()
#everywhere file_name = "test_data.csv"
println("graph creation from: ", file_name)
#everywhere data_graph = create_generic_graph_from_file(file_name)
#everywhere data_graph_vertex = vertices(data_graph)
println("num edges: ", num_edges(data_graph))
println("num vertices: ", num_vertices(data_graph))
range = cell(2)
range[1] = {1, 25000}
range[2] = {25001, 50000}
cm = pmap(centrality_mean_pmap, range)
for i = 1:length(cm)
println(length(cm[i]))
end
end
println("Elapsed: ", #elapsed main(), "\n")
centrality_mean.jl
using Graphs
function centrality_mean(start_vertex::ExVertex)
centrality_cnt = Dict{ExVertex, Int64}()
vertex_to_visit = Set{ExVertex}()
push!(vertex_to_visit, start_vertex)
cnt = 0
while !isempty(vertex_to_visit)
next_vertex_set = Set()
for vertex in vertex_to_visit
if !haskey(centrality_cnt, vertex)
centrality_cnt[vertex] = cnt
for neigh in out_neighbors(vertex, data_graph)
push!(next_vertex_set, neigh)
end
end
end
cnt += 1
vertex_to_visit = next_vertex_set
end
mean([ v for (k,v) in centrality_cnt ])
end
function centrality_mean(v_start::Int64, v_end::Int64)
n = v_end - v_start + 1;
cm = Array(Float64, n)
cnt = 0
for i = v_start:v_end
cnt += 1
cm[cnt] = centrality_mean(data_graph_vertex[i])
end
return cm
end
function centrality_mean_pmap(range::Array{})
v_start = range[1]
v_end = range[2]
centrality_mean(v_start, v_end)
end
From the Julia page on parallel computing:
Julia provides a multiprocessing environment based on message passing to allow programs to run on multiple processes in separate memory domains at once.
If I interpret this right, Julia's parallelism requires message passing to synchonize the processess. If each individual process does only a little work, and then does a message-pass, the computation will be dominated by message-passing overhead and not doing any work.
I can't see from your code, and I don't know Julia well enough, to see where the parallelism breaks are. But you have a big complicated graph that may be spread willy-nilly across multiple processes. If they need to interact on walking across graph links, you'll have exactly that kind of overhead.
You may be able to fix it by precomputing a partition of the graph into roughly equal size, highly cohesive regions. I suspect that breakup requires that same type of complex graph processing that you already want to do, so you may have a chicken and egg problem to boot.
It may be that Julia offers you the wrong parallelism model. You might want a shared address space so that threads walking the graph dont' have to use messages to traverse arcs.

speeding up some for loops in matlab

Basically I am trying to solve a 2nd order differential equation with the forward euler method. I have some for loops inside my code, which take considerable time to solve and I would like to speed things up a bit. Does anyone have any suggestions how could I do this?
And also when looking at the time it takes, I notice that my end at line 14 takes 45 % of my total time. What is end actually doing and why is it taking so much time?
Here is my simplified code:
t = 0:0.01:100;
dt = t(2)-t(1);
B = 3.5 * t;
F0 = 2 * t;
BB=zeros(1,length(t)); % Preallocation
x = 2; % Initial value
u = 0; % Initial value
for ii = 1:length(t)
for kk = 1:ii
BB(ii) = BB(ii) + B(kk) * u(ii-kk+1)*dt; % This line takes the most time
end % This end takes 45% of the other time
x(ii+1) = x(ii) + dt*u(ii);
u(ii+1) = u(ii) + dt * (F0(ii) - BB(ii));
end
Running the code it takes me 8.552 sec.
You can remove the inner loop, I think:
for ii = 1:length(t)
for kk = 1:ii
BB(ii) = BB(ii) + B(kk) * u(ii-kk+1)*dt; % This line takes the most time
end % This end takes 45% of the other time
x(ii+1) = x(ii) + dt*u(ii);
u(ii+1) = u(ii) + dt * (F0(ii) - BB(ii));
end
So BB(ii) = BB(ii) (zero at initalisation) + sum for 1 to ii of BB(kk)* u(ii-kk+1).dt
but kk = 1:ii, so for a given ii, ii-kk+1 → ii-(1:ii) + 1 → ii:-1:1
So I think this is equivalent to:
for ii = 1:length(t)
BB(ii) = sum(B(1:ii).*u(ii:-1:1)*dt);
x(ii+1) = x(ii) + dt*u(ii);
u(ii+1) = u(ii) + dt * (F0(ii) - BB(ii));
end
It doesn't take as long as 8 seconds for me using either method, but the version with only one loop is about 2x as fast (the output of BB appears to be the same).
Is the sum loop of B(kk) * u(ii-kk+1) just conv(B(1:ii),u(1:ii),'same')
The best way to speed up loops in matlab is to try to avoid them. Try if you are able to perform a matrix operation instead of the inner loop. For example try to break the calculation you do there in small parts, then decide, if there are parts you can perform in advance without knowing the results of the next iteration of the loop.
to your secound part of the question, my guess:: The end contains the check if the loop runs for another round and this check by it self is not that long but called 50.015.001 times!

Resources