Traditionional software metrics deal with quality of software. I'm looking for metrics that can be used to identify developers by their code, in the same vein as plagiarism software and stylometry can be used to identify authors by their writing style. I can imagine that certain existing metrics can be used here as well, such as comment ratio. I can also imagine metrics that would irrelevant from a quality point of view, such as the (over)use of certain methods or design patterns, average length of variable names, etc.
I'm interested either in a pointer to a collection of such metrics or studies, or individual metrics. They may be language-agnostic or related to a language or programming paradigm.
I want to use it to understand and analyze different coding styles, not to detect plagiarism.
I see there are already a couple of studies that looked into this. They might help.
Kothari, J., Shevertalov, M., Stehle, E., Mancoridis, S., "A probabilistic approach to source code authorship identification", In Proceedings of the International Conference on Information Technology, pp.243-248, IEEE, 2007.
Available online here
Quoting from the abstract:
We begin by computing a set of metrics to build profiles for a population of known authors using code samples that are verified to be authentic. We then compute metrics on unidentified source code to determine the closest matching profile. [...] In our case study we are able
to determine authorship with greater than 70% accuracy in choosing the single nearest match and greater than 90% accuracy in choosing the top three ordered nearest matches.
Shevertalov, M., Kothari, J., Stehle, E., Mancoridis, S., "On the use of discretized source code metrics for author identification", In Proceedings of the 1st International Symposium on Search Based Software Engineering, pp.69-78, IEEE, 2009.
Available online here, this is a follow-up of the previous study.
Lange, R., Mancoridis, S., "Using code metric histograms and genetic algorithms to perform author identification for software forensics", In Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation, pp.2082-2089, ACM, 2007.
Available online here
This is also related to the first reference (common author), and discusses the metrics in more detail. Again quoting from the abstract:
Our method involves measuring the differences in histogram distributions for code metrics. Identifying a combination of metrics that is effective in distinguishing developer styles is key to the utility of the technique. Our case study involves 18 metrics.
You can also use Google Scholar for other references, and for finding other papers based on the ones above (using the "cited by" option).
If you're looking for potential metrics, you might try reviewing some coding standards. Since these dictate a particular style, it follows that the things they talk about (spacing, placement of braces, identifier lengths, mandatory comments, etc.) are things that might be used to identify developers from their code.
Also, if you're interested in .NET code, you might find NDepend to be a useful tool. It enables you to run queries against a code base, and supports 82 metrics.
Related
How can determine how many rules and fuzzy sets we need in our fuzzy system?
Is by increasing the rules and fuzzy sets, the system would be better?
How can we determine how many rules and fuzzy sets we need actually for better results?
Thanks
There are many different methods of determining where you need to model fuzziness in a particular application. The overarching principles to keep in mind are: 1) Look for places where it would be beneficial to treat ordinal or nominal data on a continuous scale, even at the cost of imprecision and 2) The "fuzz" should be naturally present in the data or problem you're trying to solve; it's not a secret ingredient one adds to make an application better, as is sometimes implied by overeager enthusiasts. Only add fuzzy rules and sets where it when you can justify the added computational/data collection/other costs in terms of greater accuracy or some other practical use.
With those principles in mind, here are some ways of detecting places where fuzzy rules and sets might be useful:
• The number one candidate is natural language modeling, perhaps through a Behavioral-Driven Development (BDD) process if you're in a software development environment. For example, you can interview people with domain knowledge and look for naturally fuzzy statements, such as "cloudy," "overcast" and "sunny" in meteorology, or fuzzy numbers, like "about half" or "most." Then find membership functions that most accurately match the meaning assigned to those terms. Note that sometimes terms from multiple fuzzy sets can occur together; for example, you grade the truth of the statement "about half of these days were cloudy," which might require three separate membership functions, one for truth, one for the fuzzy number and a third for the "cloudy" category. Linguistic analysis is the simplest way, since people naturally use fuzzy language every day; be aware though that multiple fuzzy sets can actually be combined to model fuzzy logic curiosities that don't often occur in natural language, like "“John is taller than he is clever,” “Inventory is higher than it is low,” “Coffee is at least as unhealthy as it is tasty,” and “Her last novel is more political than it is confessional.” Those examples come from p. 16, Bilgic, Taner and Turksen, I.B. August 1994, “Measurement–Theoretic Justification of Connectives in Fuzzy Set Theory,” pp. 289–308 in Fuzzy Sets and Systems, January 1995. Vol. 76, No. 3.
• Another important task is sorting out how to model "linguistic connectives" like fuzzy ANDs and ORs, or crisp conjunctions between fuzzy statements. Some guiding principles have been worked out and are available in such sources as Alsina, C.; Trillas E. and Valverde, L., 1983, “On Some Logical Connectives for Fuzzy Sets Theory," pp. 15-26 in Journal of Mathematical Analysis and Applications. Vol. 93; Dubois, Didier and Prade, Henri, 1985, “A Review of Fuzzy Set Aggregation Connectives," pp. 85-121 in Information Sciences, July-August, 1985. Vol. 36, Nos. 1-2.
• Pooling the opinions of experts (as in an expert system) or the subjective scores of others (as in a movie ratings system). The ratings themselves would constitute one level of fuzziness, while another tier could be added to weight the importance of each expert or other individual's particular score, if they're particularly authoritative.
• Another option is to use neural nets to determine whether or not the addition of various fuzzy rules and sets to your model actually improves accuracy or some other metric related to your end goal.
• Other options include estimating membership functions and the parameters of T-norms and T-conorms (which are used often in fuzzy complements, unions and intersections) with such techniques as regression, Maximum Likelihood Estimation (MLE), LaGrange interpolation, curve fitting and parameter estimation. All of these are discussed in my favorite reference for fuzzy set math, Klir, George J. and Yuan, Bo, 1995, Fuzzy Sets and Fuzzy Logic: Theory and Applications. Prentice Hall: Upper Saddle River, N.J.
• In the same vein, deciding whether or not to include particular fuzzy rules or sets may depend on whether or not you can find a good fit between its underlying and possibly unknown "actual" membership function and the one's your testing. Most of the time triangular, trapezoidal or Gaussian functions will suffice, but in some situations distribution testing might be necessary to find just the right distribution function. Empirical Distribution Functions (EDFs) might come in handy here.
To make a long story short, a lot of different statistical and machine learning techniques can be applied to give ballpark answers to these questions. The key is to always stay within the bounds of the two main principles above and only model things with fuzzy sets when it would serve your practical goals, then leave out the rest. I hope that helps.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I have been learning alot about using graphs for machine learning by watching Christopher Bishops videos( http://videolectures.net/mlss04_bishop_gmvm/ ). I find it very interesting and watched a few others in the same categories(machine learning/graph) but was wondering if anyone had any recommendations for ways of learning more?
My problem is, although the videos gave a great high level understanding, I don't have much practical skills in it yet. I've read Bishops book on machine learning/patterns as well as Norvig's AI book but both don't seem to touch upon specific using graphs much. With the emergence of search engines and social networking, I would think machine learning on graphs would be popular.
If possible, can anyone suggestion an a resource to learn from? (I'm new to this field and development is a hobby for me, so I'm sorry in advance if there's a super obvious resource to learn from..I tried google and university sites).
Thanks in advance!
First, i would strongly recommend the book Social Network Analysis for Startups by Maksim Tsvetovat and Alexander Kouznetsov. A book like this is a godsend for programmers who need to quickly acquire a basic fluency in a specific discipline (in this case, graph theory) so that they can begin writing code to solve problems in this domain. Both authors are academically trained graph theoreticians but the intended audience of their book is programmers. Nearly all of the numerous examples presented in the book are in python using the networkx library.
Second, for the projects you have in mind, two kinds of libraries are very helpful if not indispensible:
graph analysis: e.g., the excellent networkx (python), or igraph
(python, R, et. al.) are two that i can recommend highly; and
graph rendering: the excellent graphViz, which can be used
stand-alone from the command line but more likely you will want to
use it as a library; there are graphViz bindings in all major
languages (e.g., for python there are at least three i know of,
though pygraphviz is my preference; for R there is rgraphviz which is
part of the bioconductor package suite). Rgraphviz has excellent documentation (see in particular the Vignette included with the Package).
It is very easy to install and begin experimenting with these libraries and in particular using them
to learn the essential graph theoretic lexicon and units of analysis
(e.g., degree sequence distribution, nodes traversal, graph
operators);
to distinguish critical nodes in a graph (e.g., degree centrality,
eigenvector centrality, assortivity); and
to identify prototype graph substructures (e.g., bipartite structure,
triangles, cycles, cliques, clusters, communities, and cores).
The value of using a graph-analysis library to quickly understand these essential elements of graph theory is that for the most part there is a 1:1 mapping between the concepts i just mentioned and functions in the (networkx or igraph) library.
So e.g., you can quickly generate two random graphs of equal size (node number), render and then view them, then easily calculate for instance the average degree sequence or betweenness centrality for both and observer first-hand how changes in the value of those parameters affects the structure of a graph.
W/r/t the combination of ML and Graph Theoretic techniques, here's my limited personal experience. I use ML in my day-to-day work and graph theory less often, but rarely together. This is just an empirical observation limited to my personal experience, so the fact that i haven't found a problem in which it has seemed natural to combine techniques in these two domains. Most often graph theoretic analysis is useful in ML's blind spot, which is the availability of a substantial amount of labeled training data--supervised ML techniques depend heavily on this.
One class of problems to illustrate this point is online fraud detection/prediction. It's almost never possible to gather data (e.g., sets of online transactions attributed to a particular user) that you can with reasonable certainty separate and label as "fraudulent account." If they were particularly clever and effective then you will mislabel as "legitimate" and for those accounts for which fraud was suspected, quite often the first-level diagnostics (e.g., additional id verification or an increased waiting period to cash-out) are often enough to cause them to cease further activity (which would allow for a definite classification). Finally, even if you somehow manage to gather a reasonably noise-free data set for training your ML algorithm, it will certainly be seriously unbalanced (i.e., much more "legitimate" than "fraud" data points); this problem can be managed with statistics pre-processing (resampling) and by algorithm tuning (weighting) but it's still a problem that will likely degrade the quality of your results.
So while i have never been able to successfully use ML techniques for these types of problems, in at least two instances, i have used graph theory with some success--in the most recent instance, by applying a model adapted from the project by a group at Carnegie Mellon initially directed to detection of online auction fraud on ebay.
MacArthur Genius Grant recipient and Stanford Professor Daphne Koller co-authored a definitive textbook on Bayesian networks entitled Probabalistic Graphical Models, which contains a rigorous introduction to graph theory as applied to AI. It may not exactly match what you're looking for, but in its field it is very highly regarded.
You can attend free online classes at Stanford for machine learning and artificial intelligence:
https://www.ai-class.com/
http://www.ml-class.org/
The classes are not simply focused on graph theory, but include a broader introduction in the field and they will give you a good idea of how and when you should apply which algorithm. I understand that you've read the introductory books on AI and ML, but I think that the online classes will provide you with a lot of exercises that you can try.
Although this is not an exact match to what you are looking for, textgraphs is a workshop that focuses on the link between graph theory and natural language processing. Here is a link. I believe the workshop also generated this book.
So here is my problem. I have two paragraphs of text and I need to see if they are similar. Not in the sense of string metrics but in meaning. The following two paragraphs are related but I need to find out if they cover the 'same' topic. Any help or direction to solving this problem would be greatly appreciated.
Fossil fuels are fuels formed by natural processes such as anaerobic
decomposition of buried dead organisms. The age of the organisms and
their resulting fossil fuels is typically millions of years, and
sometimes exceeds 650 million years. The fossil fuels, which contain
high percentages of carbon, include coal, petroleum, and natural gas.
Fossil fuels range from volatile materials with low carbon:hydrogen
ratios like methane, to liquid petroleum to nonvolatile materials
composed of almost pure carbon, like anthracite coal. Methane can be
found in hydrocarbon fields, alone, associated with oil, or in the
form of methane clathrates. It is generally accepted that they formed
from the fossilized remains of dead plants by exposure to heat and
pressure in the Earth's crust over millions of years. This biogenic
theory was first introduced by Georg Agricola in 1556 and later by
Mikhail Lomonosov in the 18th century.
Second:
Fossil fuel reforming is a method of producing hydrogen or other
useful products from fossil fuels such as natural gas. This is
achieved in a processing device called a reformer which reacts steam
at high temperature with the fossil fuel. The steam methane reformer
is widely used in industry to make hydrogen. There is also interest in
the development of much smaller units based on similar technology to
produce hydrogen as a feedstock for fuel cells. Small-scale steam
reforming units to supply fuel cells are currently the subject of
research and development, typically involving the reforming of
methanol or natural gas but other fuels are also being considered such
as propane, gasoline, autogas, diesel fuel, and ethanol.
That's a tall order. If I were you, I'd start reading up on Natural Language Processing. NLP is a fairly large field -- I would recommend looking specifically at the things mentioned in the Wikipedia Text Analytics article's "Processes" section.
I think if you make use of information retrieval, named entity recognition, and sentiment analysis, you should be well on your way.
In general, I believe that this is still an open problem. Natural language processing is still a nascent field and while we can do a few things really well, it's still extremely difficult to do this sort of classification and categorization.
I'm not an expert in NLP, but you might want to check out these lecture slides that discuss sentiment analysis and authorship detection. The techniques you might use to do the sort of text comparison you've suggested are related to the techniques you would use for the aforementioned analyses, and you might find this to be a good starting point.
Hope this helps!
You can also have a look on Latent Dirichlet Allocation (LDA) model in machine learning. The idea there is to find a low-dimensional representation of each document (or paragraph), simply as a distribution over some 'topics'. The model is trained in an unsupervised fashion using a collection of documents/paragraphs.
If you run LDA on your collection of paragraphs, then by looking into the similarity of the hidden topics vector, you can find whether a given two paragraphs are related or not.
Of course, the baseline is to not use the LDA, and instead use the term frequencies (augmented with tf/idf) to measure similarities (vector space model).
One of the things I’ve been thinking about a lot off and on is how we can use metrics of some kind to measure change, are we going backwards or not? This is in the context of a large, legacy code base which we are improving. Most of the code is C++ with a C heritage. Some new functions and the GUI are written in C#.
To start with, we could at least be checking if the simple complexity level was changing over time in the code. The difficulty is in having a representation – we can maybe do a 3D surface where a 2D map represents the code and we have a heat-map of color representing complexity with the 3D surface bulging in and out to show change.
Once you can generate some matrics of numbers there are a ton of math systems around to take care of stuff like this.
Over time, I'd like to have more sophisticated numbers in there but the same visualisation techniques used to represent change.
I like the idea in Crap4j of focusing on the ratio between complexity and number of unit tests covering that code.
I'd also like to include Uncle Bob's SOLID metrics and some of the Chidamber and Kemerer OO metrics. The hard part is finding tools to generate these for C++. The only option seems to be Krakatau Essential Metrics (I have no objection to paying for tools). My desire to use the CK metrics comes partly from the books Object-Oriented Metrics:Measures of Complexity by Henderson-Sellers and the earlier Object-Oriented Software Metrics.
If we start using a number of these metrics we could end up with ten or so numbers that are varying across time. I'm fairly ignorant of statistics but it seems it could be interesting to track a bunch of such metrics and then pay attention to which ones tend to vary.
Note that a related question is about measuring code quality across a large code base. I'm more interested in measuring the change.
I'd consider using a Kiviat Diagram to represent multiple software metrics dimensions evolving over time. These diagrams represent multiple data points in a concave hull around a centerpoint. Visual inspection will show where a particular metric is going up or down, and one ought to be able to compute an overall ratio of area biased by metric value using some hueristic area computation.
You can also have a glance at NDepend documentation about code metrics. Disclaimer: I am one of the developer of the tool NDepend.
With the Code Rule and Query over LINQ (CQLinq) facility, it is possible to ask for code metric evolution/trending across two different snapshots in time of the code base. For example there is a default rule proposed: Avoid making complex methods even more complex illustrated by the screenshot below:
Several metric trending rules are proposed like:
Avoid decreasing code coverage by[enter link description here]5 tests of types
Types that used to be 100% covered but not anymore
and also, since you mentioned Crap4J the metric C.R.A.P can be written with CQLinq, and the query could be easily tweaked to see the trending in C.R.A.P metric.
Concerning the visualization of code metric, NDepend lets visualize code metrics values through an interactive treemap:
There is a fresh approach for this topic.
E.g.
https://github.com/databricks/koalas/pull/840#issuecomment-536949320
See https://softagram.com/docs/visualizing-code-changes/ for more info or do an image search in search engine using the two keywords: softagram koalas
Disclaimer: I work for Softagram.
I have been playing around with different data clustering algorithms working on finding clusters between random data points represented an nodes, I keep reading that data clustering is used for image recognition. I am failing to make the connection, how does clustering data help in recognizing an image or in facial recognition. can someone explain this?
It's no surprise that clustering is used for pattern recognition at large, and image recognition in particular: clustering is a reducing process, and images in this megapixel era need boiling down... It is also a process which produces categories and that is of course useful.
However there are many approaches to the use of clustering as a technique for image recognition. One of the reasons for this diversity is that clustering can be applied at different level, for different purposes: from basic pixel level to feature level (feature be a line, a geometric figure...), for classification or for other purposes.
At a very high level, clustering is a statistical tool, it helps discovering the relative importance of various dimensions in defining the belonging of particular item to a particular category.
One [of many] usage[s] of such a tool, is with supervised learning, whereby a set of human-selected items (say images) are fed into the cluster-based logic, along with a label associated with a particular item ("this is an apple", "this is another apple", "this is a lemon"...), the clustering logic then determines how much each dimension of the input matters for helping each group of items (apples, lemons...) fit in a distinct cluster (for example the color may matter relatively little, but the shape, or the presence of dots, or whatever may matter a lot). After this training phase, new images can be fed to the logic and by seeing how close to a particular cluster this image falls, it is "recognized" (as a banana!).
When it comes to image processing one needs to remember that whatever is "fed" to the clustering logic is not necessarily (in fact, rarely) the raw pixels, but various "objects"
characterizing various "elements" of the original data (essentially a collection of relatively high dimension vectors, not unlike some that one may have encountered in other other data clustering examples), and produced by previous stages of the process. For example a important element of facial recognition is probably the exact distance between the center of the eyes. In previous stages, the image is processed in a way that figures out where the eyes are (possibly relying on another clustering-based logic). Then the distance between the eyes, along with many other elements are fed to the final clustering logic.
The preceding description is only one example of the use of clustering for image recognition. Indeed, various forms of neural networks have been used, very successfully, in this domain, and it can be argued that in a sense these neural networks are clustering information. One of the reasons for the success of neural nets may lie in their ability to be more respectful of the locality dimension as found in the original input, and also their ability to work in a hierarchical fashion.
A good conclusion to this write up would be a short list of online resources, but I'm pressed for time at the moment... "to be continued" ;-)
Next day edit: (failed attempt to provide an introductory online bibliography on the subject)
My search for literature on the topic of clustering as applied to artificial vision and image processing revealed two distinct... clusters ;-)
Books such as Algorithms for image processing and computer vision J Parkey pub Wiley, or Machine Vision : Theory, Algorithms, Practicalities M Seul et. Al Cambridge UP. Such books generally cover the all important techniques associated with noise reduction, Edge detection, Color or intensity conversion, and many other elements of the image processing chain, most of which do not involve clustering or even statistical methods, and they reserve only a chapter or two, or even minor mentions, to clustering, as applied to pattern recognition or to other tasks.
Scholarly papers and conference handbooks, which specifically cover clustering techniques applied to artificial vision and such, but in the narrowest and deepest fashion (ex: Variations on the Fukunaga and Narendra algorithm, for applications in character recognition, or Fast methods for selections of Nearest Neighbor candidates in whatever context.)
In short I feel ill equipped to make any specific book or article suggestion.
You may find it informative to browse titles in say Google books, keying in by "Artificial vision" or "Image Recognition" or some or the titles mentioned above. With the preview feature and also the tag cloud (btw another application of clustering) found in the "about this book" link, one can get a good idea of the various books contents and maybe decide to purchase some of them. Unfortunately the reduced readership and the potentially lucrative applications in the field make these books relatively expensive. At the other end of the spectrum, you may download, sometimes for free, research papers discussing advanced topics in the field. These will also show up on regular (web) Google, or at specialized repositories such as CiteSeer.
Good luck with your exploration in that field!