Increasing efficiency on a network randomization algorithm on Julia - for-loop

My problem is the following. I have the adjacency matrix Mat for a neural network. I want to randomize this network in the sense that I want to choose 4 notes randomly (say i,j,p,q) such that i and p are connected (which means Mat[p,i] = 1) and j and q are connected AND i and q are not connected (Mat[q,j] = 0)and j and p are not connected. I then connect i and q and j and p and disconnect the previous nodes. In one run, I want to do this 10^6 times.
So far I have two versions, one using a for loop and one recursively.
newmat = copy(Mat)
for trial in 1:Niter
count = 0
while count < 1
i,j,p,q = sample(Nodes,4,replace = false) #Choosing 4 nodes at random
if (newmat[p,i] == 1 && newmat[q,j] == 1) && (newmat[p,j] == 0 && newmat[q,i] == 0)
newmat[p,i] = 0
newmat[q,j] = 0
newmat[p,j] = 1
newmat[q,i] = 1
count += 1
end
end
end
Doing this recursively runs about just as fast until Niter = 10^4 after which I get a Stack Overflow error. How can I improve this?

I assume you are talking about a recursive variant of the for trial in 1:Niter.
To avoid stack overflows like this, a general rule of thumb (in languages without tail recursion elimination) is to not use recursion unless you know the recursion depth will not scale more than logarithmically.
The cases where this is applicable is mostly algorithms that are like tree traversals, with a "naturally occuring" recursive structure. Your case of a simple for loop can be viewed as the degenerate variant of that, with a "linked list" tree, but is not a all natural.
Just don't do it. There's nothing bad about a loop for some sequential processing like this. Julia is an imperative language, after all.
(If you want to do this with a recursive structure for fun or exercise: look up trampolines. They allow you to write code structured as tail recursive, but with the allocation happening by mutation and on the heap.)

Instead of sampling 4 random nodes and hoping they happen to be connected, you can sample the starting nodes p and q, and look for i and j within the nodes that these are connected to. Here's an implementation of that:
function randomizeconnections(adjmatin)
adjmat = copy(adjmatin)
nodes = axes(adjmat, 2)
niter = 10
for trial in 1:niter
p, q = sample(nodes, 2, replace = false)
#views plist, qlist = findall(adjmat[p, :]), findall(adjmat[q, :])
filter!(i -> !in(i, qlist) && i != q, plist)
filter!(j -> !in(j, plist) && j != p, qlist)
if isempty(plist) || isempty(qlist)
#debug "No swappable exclusive target nodes for source nodes $p and $q, skipping trial $trial..."
continue
end
i = rand(plist)
j = rand(qlist)
adjmat[p, i] = adjmat[q, j] = false
adjmat[p, j] = adjmat[q, i] = true
end
adjmat
end
Through the course of randomization, it may happen that two nodes don't have any swappable connections i.e. they may share all their end points or one's ending nodes are a subset of the other's. So there's a check for that in the above code, and the loop moves on to the next iteration in that case.
The line with the findalls in the above code effectively creates adjacency lists from the adjacency matrix on the fly. You can instead do that in one go at the beginning, and work with that adjacency list vector instead.
function randomizeconnections2(adjmatin)
adjlist = [findall(r) for r in eachrow(adjmatin)]
nodes = axes(adjlist, 1)
niter = 10
for trial in 1:niter
p, q = sample(nodes, 2, replace = false)
plist = filter(i -> !in(i, adjlist[q]) && i != q, adjlist[p])
qlist = filter(j -> !in(j, adjlist[p]) && j != p, adjlist[q])
if isempty(plist) || isempty(qlist)
#debug "No swappable exclusive target nodes for source nodes $p and $q, skipping trial $trial..."
continue
end
i = rand(plist)
j = rand(qlist)
replace!(adjlist[p], i => j)
replace!(adjlist[q], j => i)
end
create_adjmat(adjlist)
end
function create_adjmat(adjlist::Vector{Vector{Int}})
adjmat = falses(length(adjlist), length(adjlist))
for (i, l) in pairs(adjlist)
adjmat[i, l] .= true
end
adjmat
end
With the small matriced I tried locally, randomizeconnections2 seems about twice as fast as randomizeconnections, but you may want to confirm whether that's the case with your matrix sizes and values.
Both of these accept (and were tested with) BitMatrix type values as input, which should be more efficient than an ordinary matrix of booleans or integers.

Related

Hoshen-Kopelman Algorithm MatLab Implementation

First of all, I'm not 100% sure if Matlab is allowed on stack overflow; if I'm violating the rules tell me and I'll delete the question.
I'm currently studying percolation on a 2D surface (using a matrix for it), and I've written an algorithm that is able to check for percolation and label all cluster in a grid. The algorithm i wrote isn't very efficient and i've been tasked to implement the HK algorithm to do the same, more efficently since the HK doesn't "look behind".
FYI -> reticolo stands for grid, reticle
The "main" part of the program seems to work fine. Whenever i used my own way instead of the Find function i also found that in random grid where there aren't any cells that have BOTH up and left NN, everything works fine and the whole grid gets labelled correctly.
That also means that techincally the Union-Find part of the algorithm isn't really needed if the grid generated has cells that do not have both neighbors to the left and above, but that is clearly irrealistic.
What seems to be giving me problems is implementing correctly the Find function.
If anyone is able to help, I'd appreciate. Thanks
function A = HK(p,L)
if nargin == 0,0;
p = 0.55;
L = 4;
end
A.reticolo = rand(L)<p; % matrix field with prob p
A.label = zeros(L); % label field
A.prob = p; % prob field
idx = reshape(1:L^2, L, L);
nnl = [zeros(L,1) idx(:,1:L-1)]; % find near neighbor (NN) on the left
nnu = [zeros(1,L); idx(1:L-1,:)]; % find NN up
largest_label = 1;
for i = 1:L^2
left = i-L;
up = i-1;
if(A.reticolo(i) && ~A.label(i)) %If the site is coloured and has no label
%if it has NN-left and NN-up
if (nnu(i) && nnl(i) && A.reticolo(nnu(i)) && A.reticolo(nnl(i))) %#ok<*ALIGN>
% A.label(i) = min(A.label(i-L), A.label(i-1)); -> i tried
% a workaround that didn't really work
Union(left, up);
% If there's a NNL
elseif (nnl(i) && A.reticolo(nnl(i)))
A.label(i) = A.label(nnl(i)); %-> this is what i'm using
% since Find isn't working. This works for up and left cases.
% A.label(i) = Find(left); -> I should be using this instead
% If there's a NNU
elseif (nnu(i) && A.reticolo(nnu(i)))
A.label(i) = A.label(nnu(i));
%A.label(i) = Find(up) -> again, i should be using this instead of the line above
% If it doesn't have neighbours at all
else,
largest_label = largest_label +1;
A.label(i) = largest_label;
end
end
end
function F = Find(x)
while A.label(x) ~= x %here there seems to be a problem, that i don' get
x = A.label(x);
end
F = x;
end
function Union(x,y)
A.label(Find(x)) = Find(y);
end
end
And the errors I'm getting
Array indices must be positive integers or logical values.
Error in HK/Find (line 56)
while A.label(x) ~= x
Error in HK/Union (line 63)
A.label(Find(x)) = Find(y);
Error in HK (line 32)
Union(left, up);

How to implement simulated annealing to find longest path in graph

I've found a piece of pseudocode which explains simulated annealing for longest path problem, but there are a few details which I do not understand.
Currently I have implemented a structure representing graph, and method to generate random graph and random path in the graph - both uniform.
Here's the pseudocode of simulated annealing:
Procedure Anneal(G, s, t, P)
P = RandomPath(s, t, G)
temp = TEMP0
itermax = ITER0
while temp > TEMPF do
while iteration < itermax do
S = RandomNeighbor(P, G)
delta = S.len - P.len
if delta > 0 then
P = S
else
x = random01
if x < exp(delta / temp) then
P = S
endif
endif
iteration = iteration + 1
enddo
temp = Alpha(temp)
itermax = Beta(itermax)
enddo
The details which I do not find clear enough to understand are:
RandomNeighbor(P, G)
Alpha(temp)
itermax = Beta(itermax)
What are these methods supposed to do ?
RandomNeighbor(P, G): This is probably the function that creates a new solution (or new neighboring solution) from your current solution (the neighbor is chosen randomly).
Alpha(temp): That's the function that reduces the temperature (probably temp *= alpha)
itermax = Beta(itermax): I can only assume that this one is changing (most probably, resetting) the counter on iterations since it's being used on the inner while. So, when your counter for iteration reaches its max, it's reset.

Homework: Implementing Karp-Rabin; For the hash values modulo q, explain why it is a bad idea to use q as a power of 2?

I have a two-fold homework problem, Implement Karp-Rabin and run it on a test file and the second part:
For the hash values modulo q, explain why it is a bad idea to use q as a power of 2. Can you construct a terrible example e.g. for q=64
and n=15?
This is my implementation of the algorithm:
def karp_rabin(text, pattern):
# setup
alphabet = 'ACGT'
d = len(alphabet)
n = len(pattern)
d_n = d**n
q = 2**32-1
m = {char:i for i,char in enumerate(alphabet)}
positions = []
def kr_hash(s):
return sum(d**(n-i-1) * m[s[i]] for i in range(n))
def update_hash():
return d*text_hash + m[text[i+n-1]] - d_n * m[text[i-1]]
pattern_hash = kr_hash(pattern)
for i in range(0, len(text) - n + 1):
text_hash = update_hash() if i else kr_hash(text[i:n])
if pattern_hash % q == text_hash % q and pattern == text[i:i+n]:
positions.append(i)
return ' '.join(map(str, positions))
...The second part of the question is referring to this part of the code/algo:
pattern_hash = kr_hash(pattern)
for i in range(0, len(text) - n + 1):
text_hash = update_hash() if i else kr_hash(text[i:n])
# the modulo q used to check if the hashes are congruent
if pattern_hash % q == text_hash % q and pattern == text[i:i+n]:
positions.append(i)
I don't understand why it would be a bad idea to use q as a power of 2. I've tried running the algorithm on the test file provided(which is the genome of ecoli) and there's no discernible difference.
I tried looking at the formula for how the hash is derived (I'm not good at math) trying to find some common factors that would be really bad for powers of two but found nothing. I feel like if q is a power of 2 it should cause a lot of clashes for the hashes so you'd need to compare strings a lot more but I didn't find anything along those lines either.
I'd really appreciate help on this since I'm stumped. If someone wants to point out what I can do better in the first part (code efficiency, readability, correctness etc.) I'd also be thrilled to hear your input on that.
There is a problem if q divides some power of d, because then only a few characters contribute to the hash. For example in your code d=4, if you take q=64 only the last three characters determine the hash (d**3 = 64).
I don't really see a problem if q is a power of 2 but gcd(d,q) = 1.
Your implementation looks a bit strange because instead of
if pattern_hash % q == text_hash % q and pattern == text[i:i+n]:
you could also use
if pattern_hash == text_hash and pattern == text[i:i+n]:
which would be better because you get fewer collisions.
The Thue–Morse sequence has among its properties that its polynomial hash quickly becomes zero when a power of 2 is the hash module, for whatever polynomial base (d). So if you will try to search a short Thue-Morse sequence in a longer one, you will have a great lot of hash collisions.
For example, your code, slightly adapted:
def karp_rabin(text, pattern):
# setup
alphabet = '01'
d = 15
n = len(pattern)
d_n = d**n
q = 32
m = {char:i for i,char in enumerate(alphabet)}
positions = []
def kr_hash(s):
return sum(d**(n-i-1) * m[s[i]] for i in range(n))
def update_hash():
return d*text_hash + m[text[i+n-1]] - d_n * m[text[i-1]]
pattern_hash = kr_hash(pattern)
for i in range(0, len(text) - n + 1):
text_hash = update_hash() if i else kr_hash(text[i:n])
if pattern_hash % q == text_hash % q : #and pattern == text[i:i+n]:
positions.append(i)
return ' '.join(map(str, positions))
print(karp_rabin('0110100110010110100101100110100110010110011010010110100110010110', '0110100110010110'))
outputs a lot of positions, although only three of then are proper matches.
Note that I have dropped the and pattern == text[i:i+n] check. Obviously if you restore it, the result will be correct, but also it is obvious that the algorithm will do much more work checking this additional condition than for other q. In fact, because there are so many collisions, the whole idea of algorithm becomes not working: you could almost as effectively wrote a simple algorithm that checks every position for a match.
Also note that your implementation is quite strange. The whole idea of polynomial hashing is to take the modulo operation each time you compute the hash. Otherwise your pattern_hash and text_hash are very big numbers. In other languages this might mean arithmetic overflow, but in Python this will invoke big integer arithmetic, which is slow and once again loses the whole idea of the algorithm.

Fast way of checking if an element is ranked higher than another

I am writing in MATLAB a program that checks whether two elements A and B were exchanged in ranking positions.
Example
Assume the first ranking is:
list1 = [1 2 3 4]
while the second one is:
list2 = [1 2 4 3]
I want to check whether A = 3 and B = 4 have exchanged relative positions in the rankings, which in this case is true, since in the first ranking 3 comes before 4 and in the second ranking 3 comes after 4.
Procedure
In order to do this, I have written the following MATLAB code:
positionA1 = find(list1 == A);
positionB1 = find(list1 == B);
positionA2 = find(list2 == A);
positionB2 = find(list2 == B);
if (positionA1 <= positionB1 && positionA2 >= positionB2) || ...
(positionA1 >= positionB1 && positionA2 <= positionB2)
... do something
end
Unfortunately, I need to run this code a lot of times, and the find function is really slow (but needed to get the element position in the list).
I was wondering if there is a way of speeding up the procedure. I have also tried to write a MEX file that performs in C the find operation, but it did not help.
If the lists don't change within your loop, then you can determine the positions of the items ahead of time.
Assuming that your items are always integers from 1 to N:
[~, positions_1] = sort( list1 );
[~, positions_2] = sort( list2 );
This way you won't need to call find within the loop, you can just do:
positionA1 = positions_1(A);
positionB1 = positions_1(B);
positionA2 = positions_2(A);
positionB2 = positions_2(B);
If your loop is going over all possible combinations of A and B, then you can also vectorize that
Find the elements that exchanged relative ranking:
rank_diff_1 = bsxfun(#minus, positions_1, positions_1');
rank_diff_2 = bsxfun(#minus, positions_2, positions_2');
rel_rank_changed = sign(rank_diff_1) ~= sign(rank_diff_2);
[A_changed, B_changed] = find(rel_rank_changed);
Optional: Throw out half of the results, because if (3,4) is in the list, then (4,3) also will be, and maybe you don't want that:
mask = (A_changed < B_changed);
A_changed = A_changed(mask);
B_changed = B_changed(mask);
Now loop over only those elements that have exchanged relative ranking
for ii = 1:length(A_changed)
A = A_changed(ii);
B = B_changed(ii);
% Do something...
end
Instead of find try to compute something like this
Check if there is any exchanged values.
if logical(sum(abs(list1-list2)))
do something
end;
For specific values A and B:
if (list1(logical((list1-list2)-abs((list1-list2))))==A)&&(list1(logical((list1-list2)+abs((list1-list2))))==B)
do something
end;

How to easily know if a maze has a road from start to goal?

I implemented a maze using 0,1 array. The entry and goal is fixed in the maze. Entry always be 0,0 point of the maze. Goal always be m-1,n-1 point of the maze. I'm using breadth-first search algorithm for now, but the speed is not good enough. Especially for large maze (100*100 or so). Could someone help me on this algorithm?
Here is my solution:
queue = []
position = start_node
mark_tried(position)
queue << position
while(!queue.empty?)
p = queue.shift #pop the first element
return true if maze.goal?(p)
left = p.left
visit(queue,left) if can_visit?(maze,left)
right = p.right
visit(queue,right) if can_visit?(maze,right)
up = p.up
visit(queue,up) if can_visit?(maze,up)
down = p.down
visit(queue,down) if can_visit?(maze,down)
end
return false
the can_visit? method check whether the node is inside the maze, whether the node is visited, whether the node is blocked.
worst answer possible.
1) go front until you cant move
2) turn left
3) rinse and repeat.
if you make it out , there is an end.
A better solution.
Traverse through your maze keeping 2 lists for open and closed nodes. Use the famous A-Star algorithm
to choose evaluate the next node and discard nodes which are a dead end. If you run out of nodes on your open list, there is no exit.
Here is a simple algorithm which should be much faster:
From start/goal move to to the first junction. You can ignore anything between that junction and the start/goal.
Locate all places in the maze which are dead ends (they have three walls). Move back to the next junction and take this path out of the search tree.
After you have removed all dead ends this way, there should be a single path left (or several if there are several ways to reach the goal).
I would not use the AStar algorithm there yet, unless I really need to, because this can be done with some simple 'coloring'.
# maze is a m x n array
def canBeTraversed(maze):
m = len(maze)
n = len(maze[0])
colored = [ [ False for i in range(0,n) ] for j in range(0,m) ]
open = [(0,0),]
while len(open) != 0:
(x,y) = open.pop()
if x == m-1 and y == n-1:
return True
elif x < m and y < n and maze[x][y] != 0 not colored[x][y]:
colored[x][y] = True
open.extend([(x-1,y), (x,y-1), (x+1,y), (x,y+1)])
return False
Yes it's stupid, yes it's breadfirst and all that.
Here is the A* implementation
def dist(x,y):
return (abs(x[0]-y[0]) + abs(x[1]-y[1]))^2
def heuristic(x,y):
return (x[0]-y[0])^2 + (x[1]-y[1])^2
def find(open,f):
result = None
min = None
for x in open:
tmp = f[x[0]][x[1]]
if min == None or tmp < min:
min = tmp
result = x
return result
def neighbors(x,m,n):
def add(result,y,m,n):
if x < m and y < n: result.append(y)
result = []
add(result, (x[0]-1,x[1]), m, n)
add(result, (x[0],x[1]-1), m, n)
add(result, (x[0]+1,x[1]), m, n)
add(result, (x[0],x[1]+1), m, n)
return result
def canBeTraversedAStar(maze):
m = len(maze)
n = len(maze[0])
goal = (m-1,n-1)
closed = set([])
open = set([(0,0),])
g = [ [ 0 for y in range(0,n) ] for x in range(0,m) ]
h = [ [ heuristic((x,y),goal) for y in range(0,n) ] for x in range(0,m) ]
f = [ [ h[x][y] for y in range(0,n) ] for x in range(0,m) ]
while len(open) != 0:
x = find(open,f)
if x == (m-1,n-1):
return True
open.remove(x)
closed.add(x)
for y in neighbors(x,m,n):
if y in closed: continue
if y not in open:
open.add(y)
g[y[0]][y[1]] = g[x[0]][x[1]] + dist(x,y)
h[y[0]][y[1]] = heuristic(y,goal)
f[y[0]][y[1]] = g[y[0]][y[1]] + h[y[0]][y[1]]
return True
Here is my (simple) benchmark code:
def tryIt(func,size, runs):
maze = [ [ 1 for i in range(0,size) ] for j in range(0,size) ]
begin = datetime.datetime.now()
for i in range(0,runs): func(maze)
end = datetime.datetime.now()
print size, 'x', size, ':', (end - begin) / runs, 'average on', runs, 'runs'
tryIt(canBeTraversed,100,100)
tryIt(canBeTraversed,1000,100)
tryIt(canBeTraversedAStar,100,100)
tryIt(canBeTraversedAStar,1000,100)
Which outputs:
# For canBeTraversed
100 x 100 : 0:00:00.002650 average on 100 runs
1000 x 1000 : 0:00:00.198440 average on 100 runs
# For canBeTraversedAStar
100 x 100 : 0:00:00.016100 average on 100 runs
1000 x 1000 : 0:00:01.679220 average on 100 runs
The obvious here: going A* to run smoothly requires a lot of optimizations I did not bother to go after...
I would say:
Don't optimize
(Expert only) Don't optimize yet
How much time are you talking about when you say too much ? Really a 100x100 grid is so easily parsed in brute force it's a joke :/
I would have solved this with an AStar implementation. If you want even more speed, you can optimize to only generate the nodes from the junctions rather than every tile/square/step.
A method you can use that does not need to visit all nodes in the maze is as follows:
create an integer[][] with one value per maze "room"
create a queue, add [startpoint, count=1, delta=1] and [goal, count=-1, delta=-1]
start coloring the route by:
popping an object from the head of the queue, put the count at the maze point.
check all reachable rooms for a count with sign opposite to that of the rooms delta, if you find one the maze is solved: run both ways and connect the routes with the biggest steps up and down in room counts.
otherwise add all reachable rooms that have no count to the tail of the queue, with delta added to the room count.
if the queue is empty no path through the maze is possible.
This not only determines if there is a path, but also shows the shortest path possible through the maze.
You don't need to backtrack, so its O(number of maze rooms)

Resources