Directed Cyclic Probability Graph - Probability of all possible paths - probability

Consider the directed probability graph, with 4 vertices (0, 1, 2, 3) represented by the following adjacency matrix P:
[1/3, 1/3, 0, 1/3]
[1/3, 1/3, 1/3, 0]
[ 0, 0, 0, 0]
[ 0, 0, 0, 0]
The edges represent transition probabilities between vertices. The edges are [(0,0), (0,1), (0,3), (1,0), (1,1), (1,2)] each with transition probability 1/3. There are two self loops (0,0) and (1,1), and a cycle created by edges (0,1) and (1,0).
For such (and possibly bigger and complex) graphs (with self loops and cycles, and hence possibly infinite number of possible paths), how does one go about calculating the total probability of all possible cycles that start at vertex 0 and end at vertex 0?
I've calculated this for 3-vertex graphs using geometric series. For instance, which comes out to:
P(0,2) * [ P(2,0) + P(1,2)*P(2,0) ] + P(0,1) * [ P(1,0) + P(1,2)*P(2,0) ]
-------------------------------------------------------------------------
[ 1 - P(1,2) * P(2,1) ]
Which is basically the sum of probabilities of simple paths, divided by some factor which does correction for the loop made by edges (1,2) and (2,1).
With this calculation, the results for 3-vertex graphs tally with the results of the problem I'm trying to solve. I'm not sure how to scale this to bigger graphs.
PS: This is in continuation to this question and the accepted answer, in which a parameter Pr(Cii(0)|G needs to be calculated.

The parameter Pr(Cii(0)|G) that you are computing can be treated as a special case of the general Pr(Cif(0)|G) for which an algorithm is given in the answer you linked.
The case of a loop can be handled by exactly the same code since the removal of 'successor' paths in each step prevent the loop continuing.

Related

DFS Greedy Chromatic Number

In my school I learned that calculating chromatic number of a arbitrary graph is NP-Complete.
I understand why the greddy algorithm does not work, but what about DFS/Greedy algorithm?
The main idea is do a DFS an for all the vertex not yet colored, take the minimum color index over all the neighbours.
I can't figure out a counter example and this question is blowing my mind.
Thanks for all of your answers.
Pseudocode
Chromatic(Vertex x){
for each neighbour y of vertex x
if color(y) = -1
color(y) <- minimum color over all the neighbours of y
if(y>=numColor) numColors++;
Chromatic(y);
}
Main(){
Set the color of all vertex equal -1
Take an arbitrary vertex u and set color(u) = 0
numColors = 1;
Chromatic(u);
print numColors;
}
Here's a concrete counterexample: the petersen graph. Your algorithm computes 4, regardless of where you start (I think), but the graph's chromatic index is 3.
The petersen graph is a classical counterexample for many greedy attempts at graph problems, and also for conjectures in graph theory.
The answer is that sometimes you will have a vertex which has 2 colors available, and making the wrong choice will cause a problem an undetermined time later.
Suppose you have vertices 1 through 9. Draw them around a circle. Then add edges to make the following true.
1, 2, 3 form a triangle.
3 connects to 4.
4, 5, 6 make a triangle.
5, 6, 7 make a triangle.
6, 7, 8 make a triangle.
7, 8, 9 make a triangle.
8, 9, 1 make a triangle.
9, 1, 2 make a triangle.
It is easy to color this with 3 colors. But a depth-first greedy algorithm has a choice of 2 colors it can give to vertex 4. Make the wrong choice, and you'll wind up needing 4 colors, not 3.

Closest Pair Brute Force Algorithm - Basic Operation

Given the following pseudo code for calculating the distance between two closest points in the place by brute force:
for (i <-- 1 to n-1) do
for (j <-- i+1 to n) do
d <-- min(d, sqrt(xi-xj)^2 + (yi-yj)^2)
return d
The basic operation is computing the square root. But apparently computing square roots in the loop can be avoided by ignoring the square root function and comparing the values (xi-xj)^2 + (yi-yj)^2) themselves. I looked it up and got this "the smaller a number of which we take the root, the smaller its square root, meaning the square root function is strictly increasing". Therefore the basic operation becomes squaring the numbers. Is anyone able to explain this definition?
The easiest way to answer your question is by seeing why you can avoid taking the square root. Consider the following set of distances between points, in ascending order:
{2, 5, 10, 15, 30, ...} = {d1,1, d1,2, d1,3, d2,1, d2,2, ...}
Each one of these distances was computed as the square root of the sum of the x and y distances squared. But we can square each item in this set and arrive at the same order:
{4, 25, 100, 225, 900} = {d1,12, d1,22, d1,32, d2,12, d2,22, ...}
Notice carefully that the positions of the distances, including the minimum distance (the first entry), did not change position.
It should be noted that your pseudo-code does not actually compute the minimum distance, but it could easily be modified to compute this.

Eigenvector Centrality Algorithm/Pseudocode

I was wondering if anybody could point me in the direction of some eigenvector centrality pseudocode, or an algorithm (for social network analysis). I've already bounced around Wikipedia and Googled for some information, but I can't find any description of a generalized algorithm or pseudocode.
Thanks!
The eigenvector centrality of a vertex v in a graph G just seems to be the v'th entry of the dominant eigenvector of G's adjacency matrix A scaled by the sum of the entries of that eigenvector.
The power iteration, starting from any strictly-positive vector, will tend to the dominant eigenvector of A.
Notice that the only operation that power iteration needs to do is multiply A by a vector repeatedly. This is easy to do; the i'th entry of Av is just the sum of the entries of v corresponding to vertices j to which vertex i is connected.
The rate of convergence of power iteration is linear in the ratio of the largest eigenvalue to the eigenvalue whose absolute value is second largest. That is, if the largest eigenvalue is lambdamax and the second-largest-by-absolute-value eigenvalue is lambda2, the error in your eigenvalue estimate gets reduced by a factor of lambdamax / |lambda2|.
Graphs that arise in practice (social network graphs, for instance) typically have a wide gap between lambdamax and lambda2, so power iteration will typically converge acceptably fast; within a few dozen iterations and almost irrespective of the starting point, you will have an eigenvalue estimate that's within 10-9.
So, with that theory in mind, here's some pseudocode:
Let v = [1, 1, 1, 1, ... 1].
Repeat 100 times {
Let w = [0, 0, 0, 0, ... 0].
For each person i in the social network
For each friend j of i
Set w[j] = w[j] + v[i].
Set v = w.
}
Let S be the sum of the entries of v.
Divide each entry of v by S.
I only know a little about it. This is the pseudo code I learned in the class.
input: a diagonalizable matrix A
output: a scalar number h, which is the greatest(in absolute value) eigenvalue of A, and a nonzero vector v, the corresponding eigenvector of h, such that Av=hv
begin
initialization: initialize a vector b0, which may be an approximation to the dominant eigenvector or a random vector, and let k=0
while k is smaller than the maximum iteration
calculate bk+1 = A*bk/(|A*bk|)
set k=k+1
end

Finding the number of paths of given length in a undirected unweighted graph

'Length' of a path is the number of edges in the path.
Given a source and a destination vertex, I want to find the number of paths form the source vertex to the destination vertex of given length k.
We can visit each vertex as many times as we want, so if a path from a to b goes like this: a -> c -> b -> c -> b it is considered valid. This means there can be cycles and we can go through the destination more than once.
Two vertices can be connected by more than one edge. So if vertex a an vertex b are connected by two edges, then the paths , a -> b via edge 1 and a -> b via edge 2 are considered different.
Number of vertices N is <= 70, and K, the length of the path, is <= 10^9.
As the answer can be very large, it is to be reported modulo some number.
Here is what I have thought so far:
We can use breadth-first-search without marking any vertices as visited, at each iteration, we keep track of the number of edges 'n_e' we required for that path and product 'p' of the number of duplicate edges each edge in our path has.
The search search should terminate if the n_e is greater than k, if we ever reach the destination with n_eequal to k, we terminate the search and add p to out count of number of paths.
I think it we could use a depth-first-search instead of breadth first search, as we do not need the shortest path and the size of Q used in breadth first search might not be enough.
The second algorithm i have am thinking about, is something similar to Floyd Warshall's Algorithm using this approach . Only we dont need a shortest path, so i am not sure this is correct.
The problem I have with my first algorithm is that 'K' can be upto 1000000000 and that means my search will run until it has 10^9 edges and n_e the edge count will be incremented by just 1 at each level, which will be very slow, and I am not sure it will ever terminate for large inputs.
So I need a different approach to solve this problem; any help would be greatly appreciated.
So, here's a nifty graph theory trick that I remember for this one.
Make an adjacency matrix A. where A[i][j] is 1 if there is an edge between i and j, and 0 otherwise.
Then, the number of paths of length k between i and j is just the [i][j] entry of A^k.
So, to solve the problem, build A and construct A^k using matrix multiplication (the usual trick for doing exponentiation applies here). Then just look up the necessary entry.
EDIT: Well, you need to do the modular arithmetic inside the matrix multiplication to avoid overflow issues, but that's a much smaller detail.
Actually the [i][j] entry of A^k shows the total different "walk", not "path", in each simple graph. We can easily prove it by "mathematical induction".
However, the major question is to find total different "path" in a given graph.
We there are a quite bit of different algorithm to solve, but the upper bound is as follow:
(n-2)*(n-3)*...(n-k) which "k" is the given parameter stating length of path.
Let me add some more content to above answers (as this is the extended problem I faced). The extended problem is
Find the number of paths of length k in a given undirected tree.
The solution is simple for the given adjacency matrix A of the graph G find out Ak-1 and Ak and then count number of the 1s in the elements above the diagonal (or below).
Let me also add the python code.
import numpy as np
def count_paths(v, n, a):
# v: number of vertices, n: expected path length
paths = 0
b = np.array(a, copy=True)
for i in range(n-2):
b = np.dot(b, a)
c = np.dot(b, a)
x = c - b
for i in range(v):
for j in range(i+1, v):
if x[i][j] == 1:
paths = paths + 1
return paths
print count_paths(5, 2, np.array([
np.array([0, 1, 0, 0, 0]),
np.array([1, 0, 1, 0, 1]),
np.array([0, 1, 0, 1, 0]),
np.array([0, 0, 1, 0, 0]),
np.array([0, 1, 0, 0, 0])
])

Optimization problem - finding a maximum

I have a problem at hand which can be reduced to something like this :
Assume a bunch of random points in a two-dimension plane X-Y where for each Y, there could be multiple points on X and for each X, there could be multiple points on Y.
Whenever a point (Xi,Yi) is chosen, no other point with X = Xi OR Y = Yi can be chosen. We have to choose the maximum number of points.
This can be reduced to simple maximum flow problem. If you have a point (xi, yi), in graph it should be represented with path from source S to point xi, from xi to yi and from yi to the last node (sink) T.
Note, if we have points (2, 2) and (2, 5), there's still only one path from S to x2. All paths (edges) have capacity 1.
The flow in this network is the answer.
about general problem
http://en.wikipedia.org/wiki/Max_flow
update
I don't have graphic editor right now to visualise problem, but you can easily draw example by hand. Let's say, points are (3, 3) (3, 5) (2, 5)
Then edges (paths) would be
S -> x2, S -> x3
y3 -> T, y5 -> T
x3 -> y3, x3 -> y5, x2 -> y5
Flow: S -> x2 -> y5 -> T and S -> x3 -> y3 -> T
The amount of 'water' going from source to sink is 2 and so is the answer.
Also there's a tutorial about max flow algorithms
http://www.topcoder.com/tc?module=Static&d1=tutorials&d2=maxFlow
Isn't this just the Hungarian algorithm?
Create an n×n matrix, with 0 at marked vertices, and 1 at unmarked vertices. The algorithm will choose n vertices, one for each row and column, which minimizes their sum. Simply count all the chosen vertices which equal 0, and you have your answer.
from munkres import Munkres
matrix = [[0, 0, 1],
[0, 1, 1],
[1, 0, 0]]
m = Munkres()
total = 0
for row, column in m.compute(matrix):
if matrix[row][column] == 0:
print '(%i, %i)' % (row, column)
total += 1
print 'Total: %i' % total
This runs in O(n3) time, where n is the number of rows in the matrix. The maximum flow solution runs in O(V3), where V is the number of vertices. As long as there are more than n chosen intersections, this runs faster; in fact, it runs orders of magnitude faster, as the number of chosen vertices goes up.
Different solution. It turns out that there's a lot of symmetry, and the answer is a lot simpler than I originally thought. The maximum number of things you can ever do is the minimum of the unique X's and unique Y's, which is O(NlogN) if you just want the result.
Every other shape is equivalent to a rectangle that contains the points, because it doesn't matter how many points you pull from the center of a rectangle, the order will never matter (if handled as below). Any shape that you pluck a point from now has one less unique X and one less unique Y, just like a rectangle.
So the optimal solution has nothing to do with connectedness. Pick any point that is on the edge of the smallest dimension (i.e. if len(unique-Xs)>len(unique-Ys), pick anything that has either maximum or minimum X). It doesn't matter how many connections it has, just which dimension is biggest, which can easily be done while looking at the sorted-unique lists created above. If you keep a unique-x and unique-y counter and decrement them when you delete all the unique nodes in that element of the list, then each deletion is O(1) and recalculating the lengths is O(1). So repeating this N times is at worst O(N), and the final complexity is O(NlogN) (due solely to the sorting).
You can pick any point along the shortest edge because:
if there's only one on that edge, you better pick it now or something else will eliminate it
if there's more than one on that edge, who cares, you will eliminate all of them with your pick anyways
Basically, you're maximizing "max(uniqX,uniqY)" at each point.
Update: IVlad caught an edge case:
If the dimensions are equal, take the edge with the least points. Even if they aren't equal, take the top or bottom of the unique-stack you're eliminating from that has the least points.
Case in point:
Turn 1:
Points: (1, 2); (3, 5); (10, 5); (10, 2); (10, 3)
There are 3 unique X's: 1, 3, 10
There are 3 unique Y's: 2, 3, 5
The "bounding box" is (1,5),(10,5),(10,2),(1,2)
Reaction 1:
The "outer edge" (outermost uniqueX or uniqueY lists of points) that has the least points is the left. Basically, look at the sets of points in x=1,x=10 and y=2,y=5. The set for x=1 is the smallest: one point. Pick the only point for x=1 -> (1,2).
That also eliminates (10,2).
Turn 2:
Points: (3, 5); (10, 5); (10, 3)
There are 2 unique X's: 3, 10
There are 2 unique Y's: 3, 5
The "bounding box" is (3,5),(10,5),(10,3),(3,3)
Reaction 2:
The "edge" of the bounding box that has the least is either the bottom or the left. We reached the trivial case - 4 points means all edges are the outer edges. Eliminate one. Say (10,3).
That also eliminates (10,5).
Turn 3:
Points: (3, 5)
Reaction 3:
Remove (3,5).
For each point, identify the number of other points (N) that would be disqualified by the selection of that point (i.e. the ones with the same X or Y values). Then, iterate over the non-disqualified points in order of increasing number of N disqualified points. When you are finished, you will have removed the maximum number of points.
The XY plane is a red herring. Phrase it as a set of elements, each of which has a set of mutually exclusive elements.
The algorithm then becomes a depth-first search. At each level, for each candidate node, calculate the set of excluded elements, the union of currently excluded elements with the elements excluded by the candidate node. Try candidate nodes in order of fewest excluded elements to most. Keep track of the best solution so far (the fewest excluded nodes). Prune any subtrees that are worse than the current best.
As a slight improvement at the cost of possible missed solutions, you can use Bloom filters for keeping track of the excluded sets.
This looks like a problem that can be solved with dynamic programming. Look into the algorithms for longest common substring, or the knapsack problem.
Based on a recommendation from IVlad, I looked into the Hopcroft–Karp algorithm. It's generally better than both the maximum flow algorithm and the Hungarian algorithm for this problem, often significantly. Some comparisons:
In general:
Max Flow: O(V3) where V is the number of vertices.
Hungarian: O(n3) where n is the number of rows in the matrix
Hopcroft-Karp: O(V √2V) where V is the number of vertices.
For a 50×50 matrix, with 50% chosen vertices:
Max Flow: 1,2503 = 1,953,125,000
Hungarian: 503 = 125,000
Hopcroft-Karp: 1,250 √2,500 = 62,500
For a 1000×1000 matrix, with 10 chosen vertices:
Max Flow: 103 = 1,000
Hungarian: 10003 = 1,000,000,000
Hopcroft-Karp: 10 √20 ≅ 44.7
The only time the Hungarian algorithm is better is when there is a significantly high proportion of points selected.
For a 100×100 matrix, with 90% chosen vertices:
Max Flow: 9,0003 = 729,000,000,000
Hungarian: 1003 = 1,000,000
Hopcroft-Karp: 9,000 √18,000 ≅ 1,207,476.7
The Max Flow algorithm is never better.
It's also quite simple, in practice. This code uses an implementation by David Eppstein:
points = {
0 : [0, 1],
1 : [0],
2 : [1, 2],
}
selected = bipartiteMatch(points)[0]
for x, y in selected.iteritems():
print '(%i, %i)' % (x, y)
print 'Total: %i' % len(selected)

Resources