Is it possible to use NDepend to find class partitioning in an OOP language? - refactoring

Fort first let me better explain the context and the problem.
Context
We have a big class with dozen of methods; it violates many Software Engineering principles and this is clearly visible through a code metrics measure tool.
See poor cohesion, too many method, and so on.
The problem
When we try to split the class into smaller classes the problem arise with the instance methods. They need to access one or more field / property from the class and it may be useful to put all together the methods which access a specific field / property. The issue is clearly visible when trying to move a bunch of methods to a new class with Resharper → Refactor → Extract Class.
Is it possible to partition the methods and fields which are connected together (have an high cohesion to use code metric terminology)?

Certainly we can do something interesting here, but this requires a partition algorithm and I am not sure which one to choose.
The query could look like:
// Replace "NHibernate.Cfg.Configuration" with your "Namespace.TypeName"
let type = Application.Types.WithFullName("NHibernate.Cfg.Configuration").Single()
let dicoFields = type.Fields
.ToDictionary(f => f, f => f.MethodsUsingMe.Where(m => m.ParentType == f.ParentType))
let dicoMethods = type.Methods
.ToDictionary(m => m, m => m.FieldsUsed.Where(f => f.ParentType == m.ParentType))
// The partition algorithm on both dicos here
from pair in dicoFields
orderby pair.Value.Count() descending
select new { pair.Key, pair.Value }
//from pair in dicoMethods
//orderby pair.Value.Count() descending
//select new { pair.Key, pair.Value}
Does this help moving forward?

Starting from the answer of #Patrick from NDepend team I've made some further investigation arriving to a some theory and some partially solving example.
First, to find also Property used by Methods you should replace FieldsUsed with MembersUsed in the query.
I've wrote two samples:
In the first the alghoritm is working as expected (the partition are not overlapping)
In the second there is an issue (there should be no elements in common)
To understand this partial solution you can look here and here :
I've tried to ask for some advice here, and I found out that this is an optimization problem called Steiner forest which involves oriented connected Graph.
Partitioning a graph into connected subgraphs with sets of vertices that must be in the same subgraph
See also:
Component (graph theory) on Wikipedia
Partitioning a graph into connected subgraphs with sets of vertices that must be in the same subgraph
CommunityGraphPlot

Related

What's the difference between effect and control edges of V8's TurboFan?

I've read many blog posts, articles, presentation and videos, even inspected V8's source code, both the bytecode generator, the sea-of-nodes graph generator and the optimization phases, and still couldn't find an answer.
V8's optimizing compiler, TurboFan, uses an IR of type "sea of nodes". All of the academic articles I found about it says that it's basically a CFG combined with a data-flow graph, and as such has two type of edges to connect nodes: data edges and control edges. Basically, if you take only the data edges you form a data-flow graph while if you choose the control edges you get a control flow graph.
However, TurboFan has one more edge type: "effect edges" (and effect phis). I suppose that this is what this slide means when it says that this is not "sea" of nodes but "soup" of nodes, because I couldn't find this term anywhere else. From what I understood, effect edges help the compiler keep the structure of statements/expressions that if reordered will have a visible side-effect. The example everyone uses is o.f = o.f + 1: the load has to come before the store (or we'll read the new value), and the addition has to come before the store, too (or otherwise we'll store the old value and uselessly increment the result).
But I cannot understand: isn't this the goal of control edges? From searching through the code I can see that almost every node has an effect edge and a control edge. Their uses isn't obvious though. From what I understand, in sea of nodes you use control edges to constrain evaluation order. So why do we need both effect and control edges? Probably I'm missing something fundamental, but I can't find it.
TL;DR: What's the use of effect edges and EffectPhi nodes, and how they're different from control edges.
Great thanks.
The idea of a sea-of-nodes compiler is that IR nodes have maximum freedom to move around. That's why in a snippet like o.f = o.f + 1, the sequence load-add-store is not considered "control flow". Only conditions and branches are. So if we slightly extend the example:
if (condition) {
o.f = o.f + 1;
} else {
// something else
}
then, as you describe in the question, effect dependencies ensure that load-add-store are scheduled in this order, and control dependencies ensure that all three operations are only performed if condition is true (technically, if the true-branch of the if-condition is taken). Note that this is important even for the load; for instance, o might be an invalid object if condition is false, and attempting to load its f property might cause a segfault in that case.

Algorithms and the Strategy Design Pattern? Breaking a process into codependent steps

I am designing a factory that creates objects of Maze type. The factory requires three distinct steps to create the mazes and one optional step.
partition (split) the 2d or 3d closed space into Partitions (rooms). This step is called Partitioning.
Check which Partitions' surfaces overlap sufficiently (which rooms share a wall) so that it's possible to (add a door between them and) connect them. This step I call Graphing.
Run a Spanning Tree or a similar algorithm that connects all Partitions as a Tree or a similar connected Graph structure. I call this Linkning.
Optional*: Create the actual doors between Partitions.
I have a class called MazeGenerator that accepts a Partitioning(1), Graphing(2), Linking(3) and possibly a DoorGeneration(4) method.
Alright, the issue is with flexibility and usability:
Some methods of spliting the space, are very rigid (for instance spliting it into a grid) and allow us to use a much faster method in step [2], we may not have to do anything actually cause we know exactly which walls are shared by which rooms and can continue to step [3].
On the other hand, some methods for partitioning are very random and require a lot more work on step [2]. I want to allow users (other devs) to apply the fast method when it is viable but to constraint and convey that they need to use another method when they split the space with a more random method.
The issue is, how do I convey or constraint people that use this API?
So far I thought about adding a protected const array to each method step class called viableNextStep that lists all the classes that describe possible methods that can be used in conjunction with this method.
So for instance, PartitionIntoRigidGrid2D would have
Array viableMethods = [SimpleGridGraph2D, findOverlap2D] // Hope that makes sense

An understandable clusterization

I have a dataset. Each element of this set consists of numerical and categorical variables. Categorical variables are nominal and ordinal.
There is some natural structure in this dataset. Commonly, experts clusterize datasets such as mine using their 'expert knowledge', but I want to automate this process of clusterization.
Most algorithms for clusterization use distance (Euclidean, Mahalanobdis and so on) between objects to group them in clusters. But it is hard to find some reasonable metrics for mixed data types, i.e. we can't find a distance between 'glass' and 'steel'. So I came to the conclusion that I have to use conditional probabilities P(feature = 'something' | Class) and some utility function that depends on them. It is reasonable for categorical variables, and it works fine with numeric variables assuming they are distributed normally.
So it became clear to me that algorithms like K-means will not produce good results.
At this time I try to work with COBWEB algorithm, that fully matches my ideas of using conditional probabilities. But I faced another obsacles: results of clusterization are really hard to interpret, if not impossible. As a result I wanted to get something like a set of rules that describes each cluster (e.g. if feature1 = 'a' and feature2 in [30, 60], it is cluster1), like descision trees for classification.
So, my question is:
Is there any existing clusterization algorithm that works with mixed data type and produces an understandable (and reasonable for humans) description of clusters.
Additional info:
As I understand my task is in the field of conceptual clustering. I can't define a similarity function as it was suggested (it as an ultimate goal of the whoal project), because of the field of study - it is very complicated and mercyless in terms of formalization. As far as I understand the most reasonable approach is the one used in COBWEB, but I'm not sure how to adapt it, so I can get an undestandable description of clusters.
Decision Tree
As it was suggested, I tried to train a decision tree on the clustering output, thus getting a description of clusters as a set of rules. But unfortunately interpretation of this rules is almost as hard as with the raw clustering output. First of only a few first levels of rules from the root node do make any sense: closer to the leaf - less sense we have. Secondly, these rules doesn't match any expert knowledge.
So, I came to the conclusion that clustering is a black-box, and it worth not trying to interpret its results.
Also
I had an interesting idea to modify a 'decision tree for regression' algorithm in a certain way: istead of calculating an intra-group variance calcualte a category utility function and use it as a split criterion. As a result we should have a decision tree with leafs-clusters and clusters description out of the box. But I haven't tried to do so, and I am not sure about accuracy and everything else.
For most algorithms, you will need to define similarity. It doesn't need to be a proper distance function (e.g. satisfy triangle inequality).
K-means is particularly bad, because it also needs to compute means. So it's better to stay away from it if you cannot compute means, or are using a different distance function than Euclidean.
However, consider defining a distance function that captures your domain knowledge of similarity. It can be composed of other distance functions, say you use the harmonic mean of the Euclidean distance (maybe weighted with some scaling factor) and a categorial similarity function.
Once you have a decent similarity function, a whole bunch of algorithms will become available to you. e.g. DBSCAN (Wikipedia) or OPTICS (Wikipedia). ELKI may be of interest to you, they have a Tutorial on writing custom distance functions.
Interpretation is a separate thing. Unfortunately, few clustering algorithms will give you a human-readable interpretation of what they found. They may give you things such as a representative (e.g. the mean of a cluster in k-means), but little more. But of course you could next train a decision tree on the clustering output and try to interpret the decision tree learned from the clustering. Because the one really nice feature about decision trees, is that they are somewhat human understandable. But just like a Support Vector Machine will not give you an explanation, most (if not all) clustering algorithms will not do that either, sorry, unless you do this kind of post-processing. Plus, it will actually work with any clustering algorithm, which is a nice property if you want to compare multiple algorithms.
There was a related publication last year. It is a bit obscure and experimental (on a workshop at ECML-PKDD), and requires the data set to have a quite extensive ground truth in form of rankings. In the example, they used color similarity rankings and some labels. The key idea is to analyze the cluster and find the best explanation using the given ground truth(s). They were trying to use it to e.g. say "this cluster found is largely based on this particular shade of green, so it is not very interesting, but the other cluster cannot be explained very well, you need to investigate it closer - maybe the algorithm discovered something new here". But it was very experimental (Workshops are for work-in-progress type of research). You might be able to use this, by just using your features as ground truth. It should then detect if a cluster can be easily explained by things such as "attribute5 is approx. 0.4 with low variance". But it will not forcibly create such an explanation!
H.-P. Kriegel, E. Schubert, A. Zimek
Evaluation of Multiple Clustering Solutions
In 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with ECML PKDD 2011. http://dme.rwth-aachen.de/en/MultiClust2011
A common approach to solve this type of clustering problem is to define a statistical model that captures relevant characteristics of your data. Cluster assignments can be derived by using a mixture model (as in the Gaussian Mixture Model) then finding the mixture component with the highest probability for a particular data point.
In your case, each example is a vector has both real and categorical components. A simple approach is to model each component of the vector separately.
I generated a small example dataset where each example is a vector of two dimensions. The first dimension is a normally distributed variable and the second is a choice of five categories (see graph):
There are a number of frameworks that are available to run monte carlo inference for statistical models. BUGS is probably the most popular (http://www.mrc-bsu.cam.ac.uk/bugs/). I created this model in Stan (http://mc-stan.org/), which uses a different sampling technique than BUGs and is more efficient for many problems:
data {
int<lower=0> N; //number of data points
int<lower=0> C; //number of categories
real x[N]; // normally distributed component data
int y[N]; // categorical component data
}
parameters {
real<lower=0,upper=1> theta; // mixture probability
real mu[2]; // means for the normal component
simplex[C] phi[2]; // categorical distributions for the categorical component
}
transformed parameters {
real log_theta;
real log_one_minus_theta;
vector[C] log_phi[2];
vector[C] alpha;
log_theta <- log(theta);
log_one_minus_theta <- log(1.0 - theta);
for( c in 1:C)
alpha[c] <- .5;
for( k in 1:2)
for( c in 1:C)
log_phi[k,c] <- log(phi[k,c]);
}
model {
theta ~ uniform(0,1); // equivalently, ~ beta(1,1);
for (k in 1:2){
mu[k] ~ normal(0,10);
phi[k] ~ dirichlet(alpha);
}
for (n in 1:N) {
lp__ <- lp__ + log_sum_exp(log_theta + normal_log(x[n],mu[1],1) + log_phi[1,y[n]],
log_one_minus_theta + normal_log(x[n],mu[2],1) + log_phi[2,y[n]]);
}
}
I compiled and ran the Stan model and used the parameters from the final sample to compute the probability of each datapoint under each mixture component. I then assigned each datapoint to the mixture component (cluster) with higher probability to recover the cluster assignments below:
Basically, the parameters for each mixture component will give you the core characteristics of each cluster if you have created a model appropriate for your dataset.
For heterogenous, non-Euclidean data vectors as you describe, hierarchical clustering algorithms often work best. The conditional probability condition you describe can be incorporated as an ordering of attributes used to perform cluster agglomeration or division. The semantics of the resulting clusters are easy to describe.

How to implement this situation (pointA - pointB)?

Currently I'm implementing one transport service which offers collective trips, and i'm stuck in one problem:
Lets say I've got points G = {A,B,C,D,F,R,W} => in the picture below.
When the user selects from(A) -> to(W) there're points between them: {C,F,R}, i want to offer just the points which are connected with each others, like A->C, C->F.... and the other points shouldn't be visible in the select list. Any help any tips 'd be great, Thanks!
what you're asking is a path finding algorithm like A* ?

Need some help understanding this problem about maximizing graph connectivity

I was wondering if someone could help me understand this problem. I prepared a small diagram because it is much easier to explain it visually.
alt text http://img179.imageshack.us/img179/4315/pon.jpg
Problem I am trying to solve:
1. Constructing the dependency graph
Given the connectivity of the graph and a metric that determines how well a node depends on the other, order the dependencies. For instance, I could put in a few rules saying that
node 3 depends on node 4
node 2 depends on node 3
node 3 depends on node 5
But because the final rule is not "valuable" (again based on the same metric), I will not add the rule to my system.
2. Execute the request order
Once I built a dependency graph, execute the list in an order that maximizes the final connectivity. I am not sure if this is a really a problem but I somehow have a feeling that there might exist more than one order in which case, it is required to choose the best order.
First and foremost, I am wondering if I constructed the problem correctly and if I should be aware of any corner cases. Secondly, is there a closely related algorithm that I can look at? Currently, I am thinking of something like Feedback Arc Set or the Secretary Problem but I am a little confused at the moment. Any suggestions?
PS: I am a little confused about the problem myself so please don't flame on me for that. If any clarifications are needed, I will try to update the question.
It looks like you are trying to determine an ordering on requests you send to nodes with dependencies (or "partial ordering" for google) between nodes.
If you google "partial order dependency graph", you get a link to here, which should give you enough information to figure out a good solution.
In general, you want to sort the nodes in such a way that nodes come after their dependencies; AKA topological sort.
I'm a bit confused by your ordering constraints vs. the graphs that you picture: nothing matches up. That said, it sounds like you have soft ordering constraints (A should come before B, but doesn't have to) with costs for violating the constraint. An optimal algorithm for scheduling that is NP-hard, but I bet you could get a pretty good schedule using a DFS biased towards large-weight edges, then deleting all the back edges.
If you know in advance the dependencies of each node, you can easily build layers.
It's amusing, but I faced the very same problem when organizing... the compilation of the different modules of my application :)
The idea is simple:
def buildLayers(nodes):
layers = []
n = nodes[:] # copy the list
while not len(n) == 0:
layer = _buildRec(layers, n)
if len(layer) == 0: raise RuntimeError('Cyclic Dependency')
for l in layer: n.remove(l)
layers.append(layer)
return layers
def _buildRec(layers, nodes):
"""Build the next layer by selecting nodes whose dependencies
already appear in `layers`
"""
result = []
for n in nodes:
if n.dependencies in flatten(layers): result.append(n) # not truly python
return result
Then you can pop the layers one at a time, and each time you'll be able to send the request to each of the nodes of this layer in parallel.
If you keep a set of the already selected nodes and the dependencies are also represented as a set the check is more efficient. Other implementations would use event propagations to avoid all those nested loops...
Notice in the worst case you have O(n3), but I only had some thirty components and there are not THAT related :p

Resources