It is possible because PageRank was a form of eigenvalue and that is why MapReduce introduced. But there seems problems in actual implementation, such as every slave computer have to maintain a copy of the matrix?
PageRank solves the dominant eigenvector problem by iteratively finding the steady-state discrete flow condition of the network.
If NxM matrix A describes the link weight (amount of flow) from node n to node m, then
p_{n+1} = A . p_{n}
In the limit where p has converged to a steady state (p_n+1 = p_n), this is an eigenvector problem with eigenvalue 1.
The PageRank algorithm doesn't require the matrix to be held in memory, but is inefficient on dense (non-sparse) matrices. For dense matrices, MapReduce is the wrong solution -- you need locality and broad exchange among nodes -- and you should instead look at LaPACK and MPI and friends.
You can see a working pagerank implementation in the wukong library (hadoop streaming for ruby) or in the Heretrix pagerank submodule. (The heretrix code runs independently of Heretrix)
(disclaimer: I am an author of wukong.)
PREAMBLE:
Given the right sequestration of data, one could achieve parallel computing results without a complete dataset on every machine.
Take for example the following loop:
for (int i = 0; i < m[].length; i++)
{
for (int j = 0; j < m[i].length; j++)
{
m[i][j]++;
}
}
And given a matrix of the following layout:
j=0 j=1 j=2
i=0 [ ] [ ] [ ]
i=1 [ ] [ ] [ ]
i=2 [ ] [ ] [ ]
Parallel constructs exist such that the J column can be sent to each computer and the single columns are computed in parallel. The difficult part of parallelization comes when you've got loops that contain dependencies.
for (int i = 0; i < m[].length; i++)
{
for (int j = 0; j < m[i].length; j++)
{
//For obvious reasons, matrix index verification code removed
m[i][j] = m[i/2][j] + m[i][j+7];
}
}
Obviously a loop like the one above becomes extremely problematic (notice the matrix indexers.) But techniques do exist for unrolling these types of loops and creating effective parallel algorithms.
ANSWER:
It is possible that google developed a solution to compute an eigenvalue without maintaining a copy of the matrix on all slave computers. -Or- They used something like Monte Carlo or some other Approximation Algorithm to develop a "close enough" calculation.
In fact, I'd go so far as to say that Google will have gone to as great of lengths as possible to make any calculation required for their PageRank algorithm as efficient as possible. When you're running machines such as these and this (notice the Ethernet cable) you can't be transferring large datasets(100s of gigs) because it is impossible given their hardware limitations of commodity NIC cards.
With that said, Google is good at surprising the programmer community and their implementation could be entirely different.
POSTAMBLE:
Some good resources for parallel computing would include OpenMP and MPI. Both parallel implementations approach parallel computing from very different paradigms, some of which stems from machine implementation (cluster vs. distributed computing.)
I suspect it is intractable for most matrices except those w/ special structures (e.g. sparse matrices or ones w/ certain block patterns). There's way too much coupling between matrix coefficients and eigenvalues.
PageRank uses a very sparse matrix of a special form, and any conclusions from calculating its eigenvalues almost certainly don't extend to general matrices. (edit: here's another reference that looks interesting)
I can answer myself now. The PageRank algorithm take advantage of sparse matrix where it should converge at the eigenvalue with several self-multiply. Thus, in PageRank practice, the Map/Reduce procedure is valid. You can perform matrix multiply in Map procedure and form a sparse matrix in Reduce procedure. But for general matrix eigenvalue finding, it is still a tricky problem.
The apache hama project has some interesting implementation of the Jacobi eigenvalue algorithm. It runs on hadoop. Note the rotation happens in the scan of the matrix not in the map reduce.
Related
Let A be an n x n sparse matrix, represented by a sequence of m tuples of the form (i,j,a) --- with indices i,j (between 0 and n-1) and a being a value a in the underlying field F.
What algorithms are used, in practice, to solve linear systems of equations of the form Ax = b? Please describe them, don't just link somewhere.
Notes:
I'm interested both in exact solutions for finite fields, and in exact and bounded-error solutions for reals or complex numbers using floating-point representation. I suppose exact or bounded-solutions for rational numbers are also interesting.
I'm particularly interested in parallelizable solutions.
A is not fixed, i.e. you don't just get different b's for the same A.
The main two algorithms that I have used and parallelised are the Wiedemann algorithm and the Lanczos algorithm (and their block variants for GF(2) computations), both of which are better than structured gaussian elimination.
The LaMacchia-Odlyzo paper (the one for the Lanczos algorithm) will tell you what you need to know. The algorithms involve repeatedly multiplying your sparse matrix by a sequence of vectors. To do this efficiently, you need to use the right data structure (linked list) to make the matrix-vector multiply time proportional to the number of non-zero values in the matrix (i.e. the sparsity).
Paralellisation of these algorithms is trivial, but optimisation will depend upon the architecture of your system. The parallelisation of the matrix-vector multiply is done by splitting the matrix into blocks of rows (each processor gets one block), each block of rows multiplies by the vector separately. Then you combine the results to get the new vector.
I've done these types of computations extensively. The original authors that broke the RSA-129 factorisation took 6 weeks using structured gaussian elimination on a 16,384 processor MasPar. On the same machine, I worked with Arjen Lenstra (one of the authors) to solve the matrix in 4 days with block Wiedemann and 1 day with block Lanczos. Unfortunately, I never published the result!
The mathematical problem
Let there be 2n persons, and C(i,j) the "cost" of having i and j work together (the function C is quick to compute, in my case it is a given matrix, and is symmetric). The question is to find the arrangement of 2n pairs of persons that minimizes the sum of the costs of each pair.
This should be done in polynomial complexity in n, and implemented relatively easily in the Scilab language (input : cost matrix, output : pairings, for instance a n-by-2 matrix of indexes). I am aware that "relatively easily" is subject to interpretation...
Previous research
This problem is actually solved by the Blossom algorithm. See for instance this paper.
However, this (and its variants) looks like a nightmare to implement. My real problem is for n=20, so although brute force (= trying all possible pairings) is not OK (brute-forcing n=8 took an hour on my computer), pretty much anything better than brute force should do the trick; if I can avoid one week of coding at the cost of one hour of computation I'm in.
I was thinking along the lines of using the Hungarian/Munkres algorithm on a 2n-by-2n array filling the diagonal with +%inf and other elements by the symmetric cost matrix, then somehow selecting from the resulting permutation a relevant pairing, but I fail to find a reliable way to do this. (Note, the Hungarian algorithm is already coded for a separate section, so you may use it without cost to the "easy to implement" requirement.)
I hope that compared to the blossom-algorithm problem, the completeness of the graph allows for some shortcuts... (Edit: see DE's comment below, this is wrong for semi-obvious reasons)
I do not know Scilab I am afraid, but if you are willing to use Python it is very easy as the Networkx library provides support for this function:
import networkx as nx
import networkx.algorithms.matching as matching
def C(i,j):
return i*j
n=40
G=nx.Graph()
for i in range(n):
for j in range(n):
G.add_edge(i,j,weight = -C(i,j))
M = matching.max_weight_matching(G,maxcardinality=True)
for i in M:
print i,'with',M[i]
This code prints out the answer within a second.
The function C defines the cost of pairing i with j. Note that the weights are set to -C(i,j) in order to transform the max_weight_matching into a min_weight_matching algorithm.
I've seen some machine learning questions on here so I figured I would post a related question:
Suppose I have a dataset where athletes participate at running competitions of 10 km and 20 km with hilly courses i.e. every competition has its own difficulty.
The finishing times from users are almost inverse normally distributed for every competition.
One can write this problem as a matrix:
Comp1 Comp2 Comp3
User1 20min ?? 10min
User2 25min 20min 12min
User3 30min 25min ??
User4 30min ?? ??
I would like to complete the matrix above which has the size 1000x20 and a sparseness of 8 % (!).
There should be a very easy way to complete this matrix, since I can calculate parameters for every user (ability) and parameters for every competition (mu, lambda of distributions). Moreover the correlation between the competitions are very high.
I can take advantage of the rankings User1 < User2 < User3 and Item3 << Item2 < Item1
Could you maybe give me a hint which methods I could use?
Your astute observation that this is a matrix completion problem gets
you most of the way to the solution. I'll codify your intuition that
the combination of ability of a user and difficulty of the course
yields the time of a race, then present various algorithms.
Model
Let the vector u denote the speed of the users so that u_i is user i's
speed. Let the vector v denote the difficulty of the courses so
that v_j is course j's difficulty. Also when available, let t_ij be user i's time on
course j, and define y_ij = 1/t_ij, user i's speed on course j.
Since you say the times are inverse Gaussian distributed, a sensible
model for the observations is
y_ij = u_i * v_j + e_ij,
where e_ij is a zero-mean Gaussian random variable.
To fit this model, we search for vectors u and v that minimize the
prediction error among the observed speeds:
f(u,v) = sum_ij (u_i * v_j - y_ij)^2
Algorithm 1: missing value Singular Value Decomposition
This is the classical Hebbian
algorithm. It
minimizes the above cost function by gradient descent. The gradient of
f wrt to u and v are
df/du_i = sum_j (u_i * v_j - y_ij) v_j
df/dv_j = sum_i (u_i * v_j - y_ij) u_i
Plug these gradients into a Conjugate Gradient solver or BFGS
optimizer, like MATLAB's fmin_unc or scipy's optimize.fmin_ncg or
optimize.fmin_bfgs. Don't roll your own gradient descent unless you're willing to implement a very good line search algorithm.
Algorithm 2: matrix factorization with a trace norm penalty
Recently, simple convex relaxations to this problem have been
proposed. The resulting algorithms are just as simple to code up and seem to
work very well. Check out, for example Collaborative Filtering in a Non-Uniform World:
Learning with the Weighted Trace Norm. These methods minimize
f(m) = sum_ij (m_ij - y_ij)^2 + ||m||_*,
where ||.||_* is the so-called nuclear norm of the matrix m. Implementations will end up again computing gradients with respect to u and v and relying on a nonlinear optimizer.
There are several ways to do this, perhaps the best architecture to try first is the following:
(As usual, as a preprocessing step normalize your data into a uniform function with 0 mean and 1 std deviation as best you can. You can do this by fitting a function to the distribution of all race results, applying its inverse, and then subtracting the mean and dividing by the std deviation.)
Select a hyperparameter N (you can tune this as usual with a cross validation set).
For each participant and each race create an N-dimensional feature vector, initially random. So if there are R races and P participants then there are R+P feature vectors with a total of N(R+P) parameters.
The prediction for a given participant and a given race is a function of the two corresponding feature vectors (as a first try use the scalar product of these two vectors).
Alternate between incrementally improving the participant feature vectors and the race feature vectors.
To improve a feature vector use gradient descent (or some more complex optimization method) on the known data elements (the participant/race pairs for which you have a result).
That is your loss function is:
total_error = 0
forall i,j
if (Participant i participated in Race j)
actual = ActualRaceResult(i,j)
predicted = ScalarProduct(ParticipantFeatures_i, RaceFeatures_j)
total_error += (actual - predicted)^2
So calculate the partial derivative of this function wrt the feature vectors and adjust them incrementally as per a usual ML algorithm.
(You should also include a regularization term on the loss function, for example square of the lengths of the feature vectors)
Let me know if this architecture is clear to you or you need further elaboration.
I think this is a classical task of missing data recovery. There exist some different methods. One of them which I can suggest is based on Self Organizing Feature Map (Kohonen's Map).
Below it's assumed that every athlet record is a pattern, and every competition data is a feature.
Basically, you should divide your data into 2 sets: first - with fully defined patterns, and second - patterns with partially lost features. I assume this is eligible because sparsity is 8%, that is you have enough data (92%) to train net on undamaged records.
Then you feed first set to the SOM and train it on this data. During this process all features are used. I'll not copy algorithm here, because it can be found in many public sources, and even some implementations are available.
After the net is trained, you can feed patterns from the second set to the net. For each pattern the net should calculate best matching unit (BMU), based only on those features that exist in the current pattern. Then you can take from the BMU its weigths, corresponding to missing features.
As alternative, you could not divide the whole data into 2 sets, but train the net on all patterns including the ones with missing features. But for such patterns learning process should be altered in the similar way, that is BMU should be calculated only on existing features in every pattern.
I think you can have a look at the recent low rank matrix completion methods.
The assumption is that your matrix has a low rank compared to the matrix dimension.
min rank(M)
s.t. ||P(M-M')||_F=0
M is the final result, and M' is the uncompleted matrix you currently have.
This algorithm minimizes the rank of your matrix M. P in the constraint is an operator that takes the known terms of your matrix M', and constraint those terms in M to be the same as in M'.
The optimization of this problem has a relaxed version, which is:
min ||M||_* + \lambda*||P(M-M')||_F
rank(M) is relaxed to its convex hull ||M||_* Then you trade off the two terms by controlling the parameter lambda.
I have implemented MapReduce paradigm based local clustering coefficient algorithm. However I have run into serious troubles for bigger datasets or specific datasets (high average degree of a node). I tried to tune my hadoop platform and the code however the results were unsatisfactory (to say the least). No I have turned my attention to actually change/improve the algorithm. Below is my current algorithm (pseudo code)
foreach(Node in Graph) {
//Job1
/* Transform edge-based input dataset to node-based dataset */
//Job2
map() {
emit(this.Node, this.Node.neighbours) //emit myself data to all my neighbours
emit(this.Node, this.Node) //emit myself to myself
}
reduce() {
NodeNeighbourhood nodeNeighbourhood;
while(values.hasNext) {
if(myself)
this.nodeNeighbourhood.setCentralNode(values.next) //store myself data
else
this.nodeNeighbourhood.addNeighbour(values.next) //store neighbour data
}
emit(null, this.nodeNeighbourhood)
}
//Job3
map() {
float lcc = calculateLocalCC(this.nodeNeighbourhood)
emit(0, lcc) //emit all lcc to specific key, combiners are used
}
reduce() {
float combinedLCC;
int numberOfNodes;
while(values.hasNext) {
combinedLCC += values.next;
}
emit(null, combinedLCC/numberOfNodes); // store graph average local clustering coefficient
}
}
Little bit more details about the code. For directed graphs neighbour data is restricted to node ID and OUT edges destination IDs (to decrease the data size), for undirected its also node ID and edges destination IDs. Sort and Merge buffers are increased to 1.5 Gb, merge streams 80.
It can be clearly seen that Job2 is the actual problem of the whole algorithm. It generates massive amount of data that has to be sorted/copied/merged. This basically kills my algorithm performance for certain datasets. Can someone guide me on how to improve the algorithm (I was thinking about creating an iterative Job2 ["process" only M nodes out of N in each iteration until every node is "processed"], but I have abandoned this idea for now). In my opinion the Job2 map-output should be decreased, to avoid costly sort/merge processes, which kill the performance.
I have also implemented the same algorithm (3 Jobs as well, same "communication" pattern, also "Job2" problem) for the Giraph platform. However Giraph is an in-memory platform and the algorithm for the same "problematic" datasets results in an OutOfMemoryException.
For any comment, remark, guideline I will be grateful.
UPDATE
I'm going to change the algorithm "drastically". I've found this article Counting Triangles.
Once the code is implemented I'm gonna post my opinion here and more detailed code (if this approach will be successful).
UPDATE_2
In the end I've ended "modifying" NodeIterator++ algorithm to my needs (Yahoo paper is available through a link in the article). Unfortunately though I can see an improvement in the performance the end result is not as good as I have hoped. The conclusion I have reached is that the cluster which is available to me is just too small to make the LCC calculations feasible for these specific datasets. So the question remains, or rather it evolves. Does any one know of an efficient distributed/sequential algorithm for calculating LCC or triangles with limited resources available?
(By no means I am stating that the NodeIterator++ algorithm is bad, I simple state that the resources which are available to me are just not sufficient).
In the paper "MapReduce in MPI for large scale graph algorithms" the authors give a nice description of a MapReduce implementation of Triangle Counting. The paper is available here: http://www.sciencedirect.com/science/article/pii/S0167819111000172 but you may need an account to access the paper. (I'm on a University system that's paid for the subscription, so I never know what I only have access to because they've already paid.) The authors may have a draft of the paper posted on the personal website(s).
There is another way you could count triangles--probably much less efficient unless your graph is fairly dense. First, construct the adjacency matrix of your graph, A. Then compute A^3 (you could do the matrix multiplication in parallel pretty easily). Then, sum up the (i,i) entries of A^3 and divide the answer by 6. That'll give you the number of triangles because the i,j entry of A^k counts the number of length k walks from i to j and since we are only looking at length 3 walks, any walk that starts at i and ends at i after 3 steps is a triangle... overcounting by a factor of 6. This is mainly less efficient because the size of the matrix will be very large compared to the size of an edgelist if your graph is sparse.
What is the best way to test a clustering algorithm? I am using an agglomerative clustering algorithm with a stop criterion. How do I test if the clusters are formed correctly or not?
A good rule of thumb for evaluating how much a graph can be clustered (on a coarse grained level) has to do with the "eigenvalue gap". Given a weighted graph A, calculate the eigenvalues and sort them (this is the eigenvalue spectrum). When plotted, if there is a large jump in the spectrum at some point, there is a natural corresponding block to partition the graph.
Below is an example (in numpy python) that shows, given an almost block diagonal matrix there a large gap in the eigenvalue spectrum at the number of blocks (parameterized by c in the code). Note that a matrix permutation (identical to labeling your graph nodes) still gives the same spectral gap:
from numpy import *
import pylab as plt
# Make a block diagonal matrix
N = 30
c = 5
A = zeros((N*c,N*c))
for m in xrange(c):
A[m*N:(m+1)*N, m*N:(m+1)*N] = random.random((N,N))
# Add some noise
A += random.random(A.shape) * 0.1
# Make symmetric
A += A.T - diag(A.diagonal())
# Show the original matrix
plt.subplot(131)
plt.imshow(A.copy(), interpolation='nearest')
# Permute the matrix for effect
idx = random.permutation(N*c)
A = A[idx,:][:,idx]
# Compute eigenvalues
L = linalg.eigvalsh(A)
# Show the results
plt.subplot(132)
plt.imshow(A, interpolation='nearest')
plt.subplot(133)
plt.plot(sorted(L,reverse=True))
plt.plot([c-.5,c-.5],[0,max(L)],'r--')
plt.ylim(0,max(L))
plt.xlim(0,20)
plt.show()
It depends on what you want to test against.
When testing your own implementation of a known algorithm, you might want to compare the results with that of a known good implementation.
Hierarchical clustering is hard to test with respect to quality, as it is hierarchical. The common measures such as Rand index etc. are only valid for strict partitionings. You can get a strict partitioning from a hierarchical clustering, but then you need to fix the height to cut at.
Ideally you have some kind of pre-clustered data (supervised learning) and test the results of your clustering algorithm on that. Simply count the number of correct classifications divided by the total number of classifications performed to get an accuracy score.
If you are doing unsupervised learning, then there is really no way to evaluate your algorithm.
It is sometimes useful to construct input data where there is a known, and perhaps obvious, answer by construction. For a clustering algorithm, you might construct data with N clusters such that the maximum distance between any two points in the same cluster is smaller than the minimum distance between any two points in different clusters. Another option would be to generate a number of different data sets plotable as 2-d scatter diagrams with clusters obvious to the eye, then compare the result from your algorithm with this structure, perhaps moving the clusters together to see when the algorithm fails to see them.
You might be able to do better given knowledge of your particular clustering algorithm, but the above might at least have some chance of flushing obvious bugs from cover.