Graph Theory: Calculating Clustering Coefficient - algorithm

I'm doing some research and I've come to a point where I have calculate the clustering coefficient of a graph.
According to this paper directly related to my research:
The clustering coefficient C(p) is
defined as follows. Suppose that a
vertex v has kv neighbours; then at
most (kv * (kv-1)) / 2 edges can
exist between them (this occurs when
every neighbour of v is connected to
every other neighbour of v). Let Cv
denote the fraction of these allowable
edges that actually exist. Define C as
the average of Cv over all v
But this wikipedia article on the subject says differently:
C = (number of closed triplets) / (number of connected triples)
It seems to me that the latter is more computationally expensive.
So really my question is: are they equivalent?
It should be noted that the paper is cited by the Wikipedia article.
Thanks for your time.

The two formulas are not the same; they are two different ways in which the global clustering coefficient can be calculated.
One way is by averaging the clustering coefficients (C_i [1]) of all nodes (this is the method you quoted from Watts and Strogatz). However, in [2, p204] Newman argues that this method is less preferable than the second one (the one you got from wikipedia). He justifies by pointing how the value of the global clustering coeff can be dominated by nodes of low degree, due to C_i's denominator [1]. So, in a network with many nodes of low degrees, you end up with a large value for the global clustering coeff, which Newman argues would be unrepresentative.
However, many network studies (or, in my experience, at least many studies concerned with online social networks) seem to have used this method, so in order to be able to compare your results with theirs, you would require to use the same method. Furthermore, the critique raised by Newman does not affect the extent to which comparisons of global clustering coefficients can be made, provided the same method was used in measuring them.
The two formulae are different and were proposed at different moments in time. The one you quoted from Watt and Strogatz is older, which is perhaps why it seems to have been more commonly used. Newman also explains that the two formulae are far from equivalent, and shouldn't be used as such. He says they can give substantially different numbers for a given network, however doesn't explain why.
[1] C_i = (number of pairs of neighbours of i that are connected) / (number of pairs of neighbours of i)
[2] Newman, M.E.J.. Networks : an introduction. Oxford New York: Oxford University Press, 2010. Print.
Edit:
I am including here a series of calculations for the same ER random graph. You can see how the two methods give different results, even for undirected graphs. (done using Mathematica)

I think they're equivalent. The wiki page you link to gives a proof that the triples formulation is equivalent to the fraction of possible edges formulation when calculating the local clustering coefficient, i.e. calculated just at a vertex. From there it seems that you just need to show that
sum_v lambda(v)/tau(v) = 3 x # triangles / # connected triples
where lambda(v) is the number of triangles containing v, and tau(v) is the number of connected triples for which v is the middle vertex, i.e. adjacent to each of the other 2 edges.
Now each triangle gets counted three times in the numerator of the LHS. However, each connected triple is only counted once for the middle vertex on the LHS, so the denominators are the same.

I partially disagree with Whatang. These methods are only equivalent for undirected graphs. However for directed graphs they return different results. In my opinion the local clustering coefficient method is the correct one. Not to mention its less computationally expensive. For example
<-----
4 -----> 5
|<--||-->
| ||
|-> 6 -> 7
4(IN [5,6], OUT [5,6])
5(IN [4,6], OUT [4])
6(IN [4], OUT [4,5,7])
7(IN [6], OUT [])
central = 6
localCC = 2 / 4*3 = 1/6
globalCC = 1 / 3

I wouldn't trust that wikipedia article. The first formula you cited is currently defined as the Mean Clustering Coefficient, hence it is the mean of all local clustering coefficients for a graph g. This is in no way the same as the global clustering coefficient, as xk_id aptly put it.

there is a great page to learn the basics from!
http://www.learner.org/courses/mathilluminated/interactives/network/
all about cluster coefficients, small world and so on ...

Related

How do I find the longest path in a weighted graph?

If I am given a data structure with currency conversion rates:
a list of currency relationships with exchange values. (INR - USD)
Then how can I find the best exchange rate from currency1 to currency2?
My thought process:
Method 1:
if I take the list of exchange values and convert it to a graph - adjacency list and a weight list ( since this seems to be like a weighted graph problem), I can use DFS to find all possible paths and then keep a track of the path that generates the highest exchange rate (so I will multiply every conversion rate that comes in the path and store it. whenever a path generates a better conversion rate then I update this variable, therefore I have the max)
Please comment on the correctness of this algorithm. Am I thinking correctly? Would this generate the correct result?
A problem I see right away is that this is very inefficient since it would take exponential time.
Method 2: Can I just negate all the conversions and use Bellman Ford? Since Bellman Ford is used to find least costing paths in a weighted graph.
Thanks. Any guidance would be truly appreciated
Your intuition is correct- you could use DFS, and it would give you the best exchange rate (the shortest path by weight), but it would be extremely slow for large graphs.
Your second method (Bellman Ford) is a much better idea. As you mention, you'll have to multiply the exchange rates / edge weights, rather than add them, but this shouldn't pose any issues.
I assume you already worked this out, but for anyone referencing this in the future- you cannot use Dijkstra's algorithm nor its descendants like A*, because the graph, in spirit, has negative cycles. You could find a conversion rate less than 1, and potentially exploit this to get an overall lower minimum conversion rate (which you then just invert the two currencies, for a a maximum conversion rate in the opposite direction).
A mathematical digression:
A way to see this more clearly- imagine we have a few conversion rates, between 3 pairs of currencies- A, B, and C. Assuming the units check out, the overall conversion rate R across these three exchanges would be R = A * B * C. Another way we could write this would be R = e ^ log(A * B * C), where e is Euler's number, and log() is the natural logarithm (we could just as well have used 10 and log10(), or any other base). Rewriting this using the rules of logarithms, we can get R = e ^ (log(A) + log(B) + log(C)), and finally log(R) = log(A) + log(B) + log(C).
Now, if we don't care about the actual value of R, just which is largest / smallest (or we're willing to perform some exponentiation to get it), we can just settle for computing log(R), or the log of the exchange rate. The benefit to this is that the weights, while transformed to their logarithms, are added together, not multiplied. This allows us to use traditional implementations of graph algorithms unchanged (we just give them log(weight) instead of weight). If we try to give it something that would normally be between 0 and 1, we see that log(x) actually becomes negative, exposing the true nature of that edge, and the potential negative cycles it may create.
Summary
You'll want to probably use Bellman-Ford, and you should be fine just replacing addition with multiplication. If you have an existing implementation at hand, but which utilizes addition to combine edge weights, you can easily cheat by passing it the log() of the edge weight instead, and things will work "automagically".

Counting isomeric n-carbon aliphatic alkanes

An n-carbon aliphatic alkane is an unrooted tree consisting of n nodes where the degree of each node is atmost 4. As an example, see this for a list of the enumeration of some low values of n.
I am looking for an algorithm to compute the number of such n-carbon aliphatic alkanes, given an n.
I have seen this in chemistry stackexchange already. I have also thought of dynamic programming, i.e, building larger graphs from smaller components, but I cannot deal with overcounting the same isomers.
Clarification: The Carbons are just a metaphor. I do not wish to take into account the instability of C16 and C17, nor do I care about stereoisomers
So the standard approach is to use the Redfield–Pólya Theorem also known as the Pólya enumeration theorem. However it is not very 'algorithmic' - you have code like this (the Mathematica, Haskell, or one of the Python versions).
The rosettacode page also describes a more direct approach using canonical checking to avoid duplicates. The algorithm is a specialised form of orderly generation (I think) that only works for trees without vertex of edge colors and a maximum valence of 4.

maxmin clustering algorithm

I read a a paper that mention max min clustering algorithm, but i don't really quite understand what this algorithm does. Googling "max min clustering algorithm" doesn't yield any helpful result. does anybody know what this algorithm mean? this is an excerpt of the paper:
Max-min clustering proceeds by choosing an observation at random as the first centroid c1, and by setting the set C of centroids to {c1}. During the ith iteration, ci is chosen such that it maximizes the minimum Euclidean distance between ci and observations in C. Max-min clustering is preferable to a density-based clustering algorithm (e.g. k-means) which would tend to select many examples from the dense group of non-seizure data points.
I don't quite understand the bolded part.
link to paper is here
We choose each new centroid to be as far as possible from the existing centroids. Here's some Python code.
def maxminclustering(observations, k):
observations = set(observations)
if k < 1 or not observations: return set()
centroids = set([observations.pop()])
for i in range(min(k - 1, len(observations))):
newcentroid = max(observations,
key=lambda observation:
min(distance(observation, centroid)
for centroid in centroids))
observations.remove(newcentroid)
centroids.add(newcentroid)
return centroids
This sounds a lot like the farthest-points heuristic for seeding k-means, but then not performing any k-means iterations at all.
This is a surprisingly simple, but quite effective strategy. Basically it will find a number of data points that are well spread out, which can make k-means converge fast. Usually, one would discard the first (random) data point.
It only works well for low values of k though (it avoids placing centroids in the center of the data set!), and it is not very favorable to multiple runs - it tends to choose the same initial centroids again.
K-means++ can be seen as a more randomized version of this. Instead of always choosing the farthes object, it chooses far objects with increased likelihood, but may at random also choose a near neighbor. This way, you get more diverse results when running it multiple times.
You can try it out in ELKI, it is named FarthestPointsInitialMeans. If you choose the algorithm SingleAssignmentKMeans, then it will not perform k-means iterations, but only do the initial assignment. That will probably give you this "MaxMin clustering" algorithm.

Machine Learning Algorithm for Completing Sparse Matrix Data

I've seen some machine learning questions on here so I figured I would post a related question:
Suppose I have a dataset where athletes participate at running competitions of 10 km and 20 km with hilly courses i.e. every competition has its own difficulty.
The finishing times from users are almost inverse normally distributed for every competition.
One can write this problem as a matrix:
Comp1 Comp2 Comp3
User1 20min ?? 10min
User2 25min 20min 12min
User3 30min 25min ??
User4 30min ?? ??
I would like to complete the matrix above which has the size 1000x20 and a sparseness of 8 % (!).
There should be a very easy way to complete this matrix, since I can calculate parameters for every user (ability) and parameters for every competition (mu, lambda of distributions). Moreover the correlation between the competitions are very high.
I can take advantage of the rankings User1 < User2 < User3 and Item3 << Item2 < Item1
Could you maybe give me a hint which methods I could use?
Your astute observation that this is a matrix completion problem gets
you most of the way to the solution. I'll codify your intuition that
the combination of ability of a user and difficulty of the course
yields the time of a race, then present various algorithms.
Model
Let the vector u denote the speed of the users so that u_i is user i's
speed. Let the vector v denote the difficulty of the courses so
that v_j is course j's difficulty. Also when available, let t_ij be user i's time on
course j, and define y_ij = 1/t_ij, user i's speed on course j.
Since you say the times are inverse Gaussian distributed, a sensible
model for the observations is
y_ij = u_i * v_j + e_ij,
where e_ij is a zero-mean Gaussian random variable.
To fit this model, we search for vectors u and v that minimize the
prediction error among the observed speeds:
f(u,v) = sum_ij (u_i * v_j - y_ij)^2
Algorithm 1: missing value Singular Value Decomposition
This is the classical Hebbian
algorithm. It
minimizes the above cost function by gradient descent. The gradient of
f wrt to u and v are
df/du_i = sum_j (u_i * v_j - y_ij) v_j
df/dv_j = sum_i (u_i * v_j - y_ij) u_i
Plug these gradients into a Conjugate Gradient solver or BFGS
optimizer, like MATLAB's fmin_unc or scipy's optimize.fmin_ncg or
optimize.fmin_bfgs. Don't roll your own gradient descent unless you're willing to implement a very good line search algorithm.
Algorithm 2: matrix factorization with a trace norm penalty
Recently, simple convex relaxations to this problem have been
proposed. The resulting algorithms are just as simple to code up and seem to
work very well. Check out, for example Collaborative Filtering in a Non-Uniform World:
Learning with the Weighted Trace Norm. These methods minimize
f(m) = sum_ij (m_ij - y_ij)^2 + ||m||_*,
where ||.||_* is the so-called nuclear norm of the matrix m. Implementations will end up again computing gradients with respect to u and v and relying on a nonlinear optimizer.
There are several ways to do this, perhaps the best architecture to try first is the following:
(As usual, as a preprocessing step normalize your data into a uniform function with 0 mean and 1 std deviation as best you can. You can do this by fitting a function to the distribution of all race results, applying its inverse, and then subtracting the mean and dividing by the std deviation.)
Select a hyperparameter N (you can tune this as usual with a cross validation set).
For each participant and each race create an N-dimensional feature vector, initially random. So if there are R races and P participants then there are R+P feature vectors with a total of N(R+P) parameters.
The prediction for a given participant and a given race is a function of the two corresponding feature vectors (as a first try use the scalar product of these two vectors).
Alternate between incrementally improving the participant feature vectors and the race feature vectors.
To improve a feature vector use gradient descent (or some more complex optimization method) on the known data elements (the participant/race pairs for which you have a result).
That is your loss function is:
total_error = 0
forall i,j
if (Participant i participated in Race j)
actual = ActualRaceResult(i,j)
predicted = ScalarProduct(ParticipantFeatures_i, RaceFeatures_j)
total_error += (actual - predicted)^2
So calculate the partial derivative of this function wrt the feature vectors and adjust them incrementally as per a usual ML algorithm.
(You should also include a regularization term on the loss function, for example square of the lengths of the feature vectors)
Let me know if this architecture is clear to you or you need further elaboration.
I think this is a classical task of missing data recovery. There exist some different methods. One of them which I can suggest is based on Self Organizing Feature Map (Kohonen's Map).
Below it's assumed that every athlet record is a pattern, and every competition data is a feature.
Basically, you should divide your data into 2 sets: first - with fully defined patterns, and second - patterns with partially lost features. I assume this is eligible because sparsity is 8%, that is you have enough data (92%) to train net on undamaged records.
Then you feed first set to the SOM and train it on this data. During this process all features are used. I'll not copy algorithm here, because it can be found in many public sources, and even some implementations are available.
After the net is trained, you can feed patterns from the second set to the net. For each pattern the net should calculate best matching unit (BMU), based only on those features that exist in the current pattern. Then you can take from the BMU its weigths, corresponding to missing features.
As alternative, you could not divide the whole data into 2 sets, but train the net on all patterns including the ones with missing features. But for such patterns learning process should be altered in the similar way, that is BMU should be calculated only on existing features in every pattern.
I think you can have a look at the recent low rank matrix completion methods.
The assumption is that your matrix has a low rank compared to the matrix dimension.
min rank(M)
s.t. ||P(M-M')||_F=0
M is the final result, and M' is the uncompleted matrix you currently have.
This algorithm minimizes the rank of your matrix M. P in the constraint is an operator that takes the known terms of your matrix M', and constraint those terms in M to be the same as in M'.
The optimization of this problem has a relaxed version, which is:
min ||M||_* + \lambda*||P(M-M')||_F
rank(M) is relaxed to its convex hull ||M||_* Then you trade off the two terms by controlling the parameter lambda.

Algorithm for nearest point

I've got a list of ~5000 points (specified as longitude/latitude pairs), and I want to find the nearest 5 of these to another point, specified by the user.
Can anyone suggest an efficient algorithm for working this out? I'm implementing this in Ruby, so if there's a suitable library then that would be good to know, but I'm still interested in the algorithm!
UPDATE: A couple of people have asked for more specific details on the problem. So here goes:
The 5000 points are mostly within the same city. There might be a few outside it, but it's safe to assume that 99% of them lie within a 75km radius, and that all of them lie within a 200km radius.
The list of points changes rarely. For the sake of argument, let's say it gets updated once per day, and we have to deal with a few thousand requests in that time.
You could accelerate the search by partitioning the 2D space with a quad-tree or a kd-tree and then once you've reach a leaf node you compare the remaining distances one by one until you find the closest match.
See also this blog post which refers to this other blog post which both discuss nearest neighbors searches with kd-trees in Ruby.
You can get a very fast upper-bound estimator on distance using Manhattan distance (scaled for latitude), this should be good enough for rejecting 99.9% of candidates if they're not close (EDIT: since then you tell us they are close. In that case, your metric should be distance-squared, as per Lars H comment).
Consider this equivalent to rejecting anything outside a spherical-rectangle bounding-box (as an approximation to a circle bounding-box).
I don't do Ruby so here is algorithm with pseudocode:
Let the latitude, longitude of your reference point P (pa,po) and the other point X (xa,xo).
Precompute ka, the latitude scaling factor for longitudinal distances: ka (= cos(pa in°)). (Strictly, ka = constant is a linearized approximation in the vicinity of P.)
Then the distance estimator is: D(X,P) = ka*|xa-pa| + |xo-po| = ka*da + do
where |z| means abs(z). At worst this overestimates true distance by a factor of √2 (when da==do), hence we allow for that as follows:
Do a running search and keep Dmin, the fifth-smallest scaled-Manhattan-distance-estimate.
Hence you can reject upfront all points for which D(X,P) > √2 * Dmin (since they must be at least farther away than √((ka*da)² + do²) - that should eliminate 99.9% of points).
Keep a list of all remaining candidate points with D(X,P) <= √2 * Dmin. Update Dmin if you found a new fifth-smallest D. Priority-queue, or else a list of (coord,D) are good data structures.
Note that we never computed Euclidean distance, we only used float multiplication and addition.
(Consider this similar to quadtree except filtering out everything except the region that interests us, hence no need to compute accurate distances upfront or build the data structure.)
It would help if you tell us the expected spread in latitudes, longitudes (degrees, minutes or what? If all the points are close, the √2 factor in this estimator will be too conservative and mark every point as a candidate; a lookup-table based distance estimator would be preferable.)
Pseudocode:
initialize Dmin with the fifth-smallest D from the first five points in list
for point X in list:
if D(X,P) <= √2 * Dmin:
insert the tuple (X,D) in the priority-queue of candidates
if (Dmin>D): Dmin = D
# after first pass, reject candidates with D > √2 * Dmin (use the final value of Dmin)
# ...
# then a second pass on candidates to find lowest 5 exact distances
Since your list is quite short, I'd highly recommend brute force. Just compare all 5000 to the user-specified point. It'll be O(n) and you'll get paid.
Other than that, a quad-tree or Kd-tree are the usual approaches to spacial subdivision. But in your case, you'll end up doing a linear number of insertions into the tree, and then a constant number of logarithmic lookups... a bit of a waste, when you're probably better off just doing a linear number of distance comparisons and being done with it.
Now, if you want to find the N nearest points, you're looking at sorting on the computed distances and taking the first N, but that's still O(n log n)ish.
EDIT: It's worth noting that building the spacial tree becomes worthwhile if you're going to reuse the list of points for multiple queries.
Rather than pure brute-force, for 5000 nodes, I would calculate the individual x+y distances for every node, rather than the straight line distance.
Once you've sorted that list, if e.g. x+y for the 5th node is 38, you can rule out any node where either x or y distance is > 38. This way, you can rule out a lot of nodes without having to calculate the straight line distance. Then brute force calculate the straight line distance for the remaining nodes.
These algorithms are not easily explained, thus I will only give you some hints in the right direction. You should look for Voronoi Diagrams. With a Voronoi Diagram you can easily precompute a graph in O(n^2 log n) time and search the closest point in O(log n) time.
Precomputation is done with a cron job at night and searching is live. This corresponds to your specification.
Now you could save the k closests pairs of each of your 5000 points and then starting from the nearest point from the Voronoi Diagram and search the remaining 4 points.
But be warned that these algorithms are not very easy to implement.
A good reference is:
de Berg: Computational Geometry Algorithms Applications (2008) chapters 7.1 and 7.2
Since you have that few points, I would recommend doing a brute-force search, to the effect of trying all points against each other with is an O(n^2) operation, with n = 5000, or roughly 25/2 million iterations of a suitable algorithm, and just storing the relevant results. This would have sub 100 ms execution time in C, so we are looking at a second or two at the most in Ruby.
When the user picks a point, you can use your stored data to give the results in constant time.
EDIT I re-read your question, and it seems as though the user provides his own last point. In that case it's faster to just do a O(n) linear search through your set each time user provides a point.
if you need to repeat this multiple times, with different user-entered locations, but don't want to implement a quad-tree (or can't find a library implementation) then you can use a locality-sensitive hashing (kind-of) approach that's fairly intuitive:
take your (x,y) pairs and create two lists, one of (x, i) and one of (y, i) where i is the index of the point
sort both lists
then, when given a point (X, Y),
bisection sort for X and Y
expand outwards on both lists, looking for common indices
for common indices, calculate exact distances
stop expanding when the differences in X and Y exceed the exact distance of the most-distant of the current 5 points.
all you're doing is saying that a nearby point must have a similar x and a similar y value...

Resources