Should LOC counting include tests and comments? - metrics

While LOC (# lines of code) is a problematic measurement of a code's complexity, it is the most popular one, and when used very carefully, can provide a rough estimate of at least relative complexities of code bases (i.e. if one program is 10KLOC and another is 100KLOC, written in the same language, by teams of roughly the same competence, the second program is almost certainly much more complex).
When counting lines of code, do you prefer to count comments in ? What about tests?
I've seen various approaches to this. Tools like cloc and sloccount allow to either include or exclude comments. Other people consider comments part of the code and its complexity.
The same dilemma exists for unit tests, that can sometimes reach the size of the tested code itself, and even exceed it.
I've seen approaches all over the spectrum, from counting only "operational" non-comment non-blank lines, to "XXX lines of tested, commented code", which is more like running "wc -l on all code files in the project".
What is your personal preference, and why?

A wise man once told me 'you get what you measure' when it comes to managing programmers.
If you rate them in their LOC output amazingly you tend to get a lot of lines of code.
If you rate them on the number of bugs they close out, amazingly you get a lot of bugs fixed.
If you rate them on features added, you get a lot of features.
If you rate them on cyclomatic complexity you get ridiculously simple functions.
Since one of the major problems with code bases these days is how quickly they grow and how hard they are to change once they've grown, I tend to shy away from using LOC as a metric at all, because it drives the wrong fundamental behavior.
That said, if you have to use it, count sans comments and tests and require a consistent coding style.
But if you really want a measure of 'code size' just tar.gz the code base. It tends to serve as a better rough estimate of 'content' than counting lines which is susceptible to different programming styles.

Tests and comments have to be maintained too. If you're going to use LOC as a metric (and I'm just going to assume that I can't talk you out of it), you should give all three (lines of real code, comments, tests).
The most important (and hopefully obvious) thing is that you be consistent. Don't report one project with just the lines of real code and another with all three combined. Find or create a tool that will automate this process for you and generate a report.
Lines of Code: 75,000
Lines of Comments: 10,000
Lines of Tests: 15,000
Total: 100,000
This way you can be sure it will
Get done.
Get done the same way every time.

I personally don't feel that the LOC metric on its own is as useful as some of the other code metrics.
NDepend will give you the LOC metric but will also give you many others, such cyclometric complexity. Rather than list them all, here's the link to the list.
There is also a free CodeMetric add-in for Reflector

I'm not going to directly answer your question for a simple reason: I hate the lines of code metric. No matter what you're trying to measure it's very hard to do worse than LOC; Pretty much any other metric you care to think of is going to be better.
In particular, you seem to want measure the complexity of your code. Overall cyclometric complexity (also called McCabe's complexity) is much better metric for this.
Routines with a high cyclometric complexity are the routines you want to focus your attention on. It's these routines that are difficult to test, rotten to the core with bugs and hard to maintain.
There are many tools that measure this sort of complexity. A quick Google search on your favourite language will find dozens of tools that do this sort of complexity.

Lines of Code means exactly that: No comments or empty lines are counted. And in order for it to be comparable to other source code (no matter if the metric in itsle fis helpful or not), you need at least similar coding styles:
for (int i = 0; i < list.count; i++)
// do some stuff
for (int i = 0; i < list.count; i++){
// do some stuff
The second version does exactly the same, but has one LOC less. When you have a lot of nested loops, this can sum up quite a bit. Which is why metrics like function points were invented.

Depends on what you are using the LOC for.
As a complexity measure - not so much. Maybe the 100KLOC are mostly code generated from a simple table, and the 10KLOC kas 5KLOC regexps.
However, I see every line of code associated with a running cost. You pay for every line as long as the program lives: it needs to be read when maintained, it might contain an error that needs to be fixed, it increases compile time, get-from-source-control and backup times, before you change or remove it you may need to find out if anyone relies on it etc. The average cost may be nanopennies per line and day, but it's stuff that adds up.
KLOC can be a first shot indicator of how much infrastructure a project needs. In that case, I would include comments and tests - even though the running cost of a comment line is much lower than one of the regexp's in the second project.
[edit] [someone with a similar opinion about code size]1

We only use a lines of code metric for one thing - a function should contain few enough lines of code to be read without scrolling the screen. Functions bigger than that are usually hard to read, even if they have a very low cyclometric complexity. For his use we do count whitespace and comments.
It can also be nice to see how many lines of code you've removed during a refactor - here you only want to count actual lines of code, whitespace that doesn't aid readability and comments that aren't useful (which can't be automated).
Finally a disclaimer - use metrics intelligently. A good use of metrics is to help answer the question 'which part of the code would benefit most from refactoring' or 'how urgent is a code review for the latest checkin?' - a 1000 line function with a cyclomatic complexity of 50 is a flashing neon sign saying 'refactor me now'. A bad use of metrics is 'how productive is programmer X' or 'How complicated is my software'.

Excerpt from the article: How do you count your number of Lines Of Code (LOC) ? relative to the tool NDepend that counts the logical numbers of lines of code for .NET programs.
How do you count your number of Lines Of Code (LOC) ?
Do you count method signature declaration? Do you count lines with only bracket? Do you count several lines when a single method call is written on several lines because of a high number of parameters? Do you count ‘namespaces’ and ‘using namespace’ declaration? Do you count interface and abstract methods declaration? Do you count fields assignment when they are declared? Do you count blank line?
Depending on the coding style of each of developer and depending on the language choose (C#, VB.NET…) there can be significant difference by measuring the LOC.
Apparently measuring the LOC from parsing source files looks like a complex subject. Thanks to an astute there exists a simple way to measure exactly what is called the logical LOC. The logical LOC has 2 significant advantages over the physical LOC (the LOC that is inferred from parsing source files):
Coding style doesn’t interfere with logical LOC. For example the LOC won’t change because a method call is spawn on several lines because of a high number of arguments.
Logical LOC is independent from the language. Values obtained from assemblies written with different languages are comparable and can be summed.
In the .NET world, the logical LOC can be computed from the PDB files, the files that are used by the debugger to link the IL code with the source code. The tool NDepend computes the logical LOC for a method this way: it is equals to the number of sequence point found for a method in the PDB file. A sequence point is used to mark a spot in the IL code that corresponds to a specific location in the original source. More info about sequence points here. Notice that sequence points which correspond to C# braces‘{‘ and ‘}’ are not taken account.
Obviously, the LOC for a type is the sum of its methods’ LOC, the LOC for a namespace is the sum of its types’ LOC, the LOC for an assembly is the sum of its namespaces’ LOC and the LOC for an application is the sum of its assemblies LOC. Here are some observations:
Interfaces, abstract methods and enumerations have a LOC equals to 0. Only concrete code that is effectively executed is considered when computing LOC.
Namespaces, types, fields and methods declarations are not considered as line of code because they don’t have corresponding sequence points.
When the C# or VB.NET compiler faces an inline instance fields initialization, it generates a sequence point for each of the instance constructor (the same remark applies for inline static fields initialization and static constructor).
LOC computed from an anonymous method doesn’t interfere with the LOC of its outer declaring methods.
The overall ratio between NbILInstructions and LOC (in C# and VB.NET) is usually around 7.


Creating the N most accurate sparklines for M sets of data

I recently constructed a simple name popularity tool ( that allows users to select names and investigate their popularity over time and by state. This is just a fun project and has no commercial or professional value, but solved a curiosity itch.
One improvement I would like to add is the display of simple sparklines beside each name in the select list, showing the normalized national popularity trends since 1910.
Doing an image request for every single name -- where hypothetically I've preconstructed the spark lines for every possible variant -- would slow the interface too much and yield a lot of unnecessary traffic as users quickly scroll and filter past hundreds of names they aren't interested in. Building sprites with sparklines for sets of names is a possibility, but again with tens of thousands of names, in the end the user's cache would be burdened with a lot of unnecessary information.
My goal is absolutely tuned minimalism.
Which got me contemplating the interesting challenge of taking M sets of data (occurrences over time) and distilling that to the most proximal N representative sparklines. For this purpose they don't have to be exact, but should be a general match, and where I could tune N to yield a certain accuracy number.
Essentially a form of sparkline lossy compression.
I feel like this most certainly is a solved problem, but can't find or resolve the heuristics that would yield the algorithms that would shorten the path.
What you describe seems to be cluster analysis - e.g. shoving that into Wikipedia will give you a starting point. Particular methods for cluster analysis include k-means and single linkage. A related topic is Latent Class Analysis.
If you do this, another option is to look at the clusters that come out, give them descriptive names, and then display the cluster names rather than inaccurate sparklines - or I guess you could draw not just a single line in the sparkline, but two or more lines showing the range of popularities seen within that cluster.

My Algorithm only fails for large values - How do I debug this?

I'm working on transcribing as3delaunay to Objective-C. For the most part, the entire algorithm works and creates graphs exactly as they should be. However, for large values (thousands of points), the algorithm mostly works, but creates some incorrect graphs.
I've been going back through and checking the most obvious places for error, and I haven't been able to actually find anything. For smaller values I ran the output of the original algorithm and placed it into JSON files. I then read that output in to my own tests (tests with 3 or 4 points only), and debugged until the output matched; I checked the output of the two algorithms line for line, and found the discrepancies. But I can't feasibly do that for 1000 points.
Answers don't need to be specific to my situation (although suggesting tools I can use would be excellent).
How can I debug algorithms that only fail for large values?
If you are transcribing an existing algorithm to Objective-C, do you have a working original in some other language? In that case, I would be inclined to put in print statements in both versions and debug the first discrepancy (the first, because later discrepancies could be knock-on errors).
I think it is very likely that the program also makes mistakes for smaller graphs, but more rarely. My first step would in fact be to use the working original (or some other means) to run a large number of automatically checked test runs on small graphs, hoping to find the bug on some more manageable input size.
Find the threshold
If it works for 3 or 4 items, but not for 1000, then there's probably some threshold in between. Use a binary search to find that threshold.
The threshold itself may be a clue. For example, maybe it corresponds to a magic value in the algorithm or to some other value you wouldn't expect to be correlated. For example, perhaps it's a problem when the number of items exceeds the number of pixels in the x direction of the chart you're trying to draw. The clue might be enough to help you solve the problem. If not, it may give you a clue as to how to force the problem to happen with a smaller value (e.g., debug it with a very narrow chart area).
The threshold may be smaller than you think, and may be directly debuggable.
If the threshold is a big value, like 1000. Perhaps you can set a conditional breakpoint to skip right to iteration 999, and then single-step from there.
There may not be a definite threshold, which suggests that it's not the magnitude of the input size, but some other property you should be looking at (e.g., powers of 10 don't work, but everything else does).
Decompose the problem and write unit tests
This can be tedious but is often extremely valuable--not just for the current issue, but for the future. Convince yourself that each individual piece works in isolation.
Re-visit recent changes
If it used to work and now it doesn't, look at the most recent changes first. Source control tools are very useful in helping you remember what has changed recently.
Remove code and add it back piece by piece
Comment out as much code as you can and still get some kind of reasonable output (even if that output doesn't meet all the requirements). For example, instead of using a complicated rounding function, just truncate values. Comment out code that adds decorative touches. Put assert(false) in any special case handlers you don't think should be activated for the test data.
Now verify that output, and slowly add back the functionality you removed, one baby step at a time. Test thoroughly at each step.
Profile the code
Profiling is usually for optimization, but it can sometimes give you insight into code, especially when the data size is too large for single-stepping through the debugger. I like to use line or statement counts. Is the loop body executing the number of times you expect? Or twice as often? Or not at all? How about the then and else clauses of those if statements? Logic bugs often become very obvious with this type of profiling.

Preventing generation of swastika-like images when generating identicons

I am using this PHP script to generate identicons. It uses Don Park's original identicon algorithm.
The script works great and I have adapted it to my own application to generate identicons. The problem is that sometimes swastikas are generated. While swastikas have peaceful origins, people do take offence when seeing those symbols.
What I would like to do is to alter the algorithm so that swastikas are never generated. I have done a bit of digging and found this thread on Microsoft's website where an employee states that they have added a tweak to prevent generation of swastikas, but nothing more.
Has anyone identified what the tweak would be and how to prevent swastikas from being generated?
Identicons appear to me (on a quick glance) always to have four-fold rotational symmetry. Swastikas certainly do. How about just repeating the quarter-block in a different way? If you take a quarter-block that would produce a swastika in the current pattern, and reflect two diagonally-opposite quarters, then you get a sort of space invader.
Basically, nothing with reflectional symmetry can look very much like a swastika. I suppose if there's a small swastika entirely contained within the quarter, then you still have a problem.
On Jeff Atwood's introducing thread, Don Park suggested:
Re Swastika comments, that can be addressed by applying a specialized OCR-like visual analysis to identify all offending codes then crunch them into an effective bloom filter using genetic algorithm. When the filter returns true, a second type of identicon (i.e. 4-block quilt) can be used.
Alternatively, you could avoid the issue entirely by replacing identicons with unicorns.
My original suggestion involving visual analysis was in context of the particular algorithm in use, namely 9-block quilt.
If you want to try another algorithm without Swastika problem, try introducing symmetry like one seen in inkblots to popular 16-block quilt Identicons.

Method for runtime comparison of two programs' objects

I am working through a particular type of code testing that is rather nettlesome and could be automated, yet I'm not sure of the best practices. Before describing the problem, I want to make clear that I'm looking for the appropriate terminology and concepts, so that I can read more about how to implement it. Suggestions on best practices are welcome, certainly, but my goal is specific: what is this kind of approach called?
In the simplest case, I have two programs that take in a bunch of data, produce a variety of intermediate objects, and then return a final result. When tested end-to-end, the final results differ, hence the need to find out where the differences occur. Unfortunately, even intermediate results may differ, but not always in a significant way (i.e. some discrepancies are tolerable). The final wrinkle is that intermediate objects may not necessarily have the same names between the two programs, and the two sets of intermediate objects may not fully overlap (e.g. one program may have more intermediate objects than the other). Thus, I can't assume there is a one-to-one relationship between the objects created in the two programs.
The approach that I'm thinking of taking to automate this comparison of objects is as follows (it's roughly inspired by frequency counts in text corpora):
For each program, A and B: create a list of the objects created throughout execution, which may be indexed in a very simple manner, such as a001, a002, a003, a004, ... and similarly for B (b001, ...).
Let Na = # of unique object names encountered in A, similarly for Nb and # of objects in B.
Create two tables, TableA and TableB, with Na and Nb columns, respectively. Entries will record a value for each object at each trigger (i.e. for each row, defined next).
For each assignment in A, the simplest approach is to capture the hash value of all of the Na items; of course, one can use LOCF (last observation carried forward) for those items that don't change, and any as-yet unobserved objects are simply given a NULL entry. Repeat this for B.
Match entries in TableA and TableB via their hash values. Ideally, objects will arrive into the "vocabulary" in approximately the same order, so that order and hash value will allow one to identify the sequences of values.
Find discrepancies in the objects between A and B based on when the sequences of hash values diverge for any objects with divergent sequences.
Now, this is a simple approach and could work wonderfully if the data were simple, atomic, and not susceptible to numerical precision issues. However, I believe that numerical precision may cause hash values to diverge, though the impact is insignificant if the discrepancies are approximately at the machine tolerance level.
First: What is a name for such types of testing methods and concepts? An answer need not necessarily be the method above, but reflects the class of methods for comparing objects from two (or more) different programs.
Second: What are standard methods exist for what I describe in steps 3 and 4? For instance, the "value" need not only be a hash: one might also store the sizes of the objects - after all, two objects cannot be the same if they are massively different in size.
In practice, I tend to compare a small number of items, but I suspect that when automated this need not involve a lot of input from the user.
Edit 1: This paper is related in terms of comparing the execution traces; it mentions "code comparison", which is related to my interest, though I'm concerned with the data (i.e. objects) than with the actual code that produces the objects. I've just skimmed it, but will review it more carefully for methodology. More importantly, this suggests that comparing code traces may be extended to comparing data traces. This paper analyzes some comparisons of code traces, albeit in a wholly unrelated area of security testing.
Perhaps data-tracing and stack-trace methods are related. Checkpointing is slightly related, but its typical use (i.e. saving all of the state) is overkill.
Edit 2: Other related concepts include differential program analysis and monitoring of remote systems (e.g. space probes) where one attempts to reproduce the calculations using a local implementation, usually a clone (think of a HAL-9000 compared to its earth-bound clones). I've looked down the routes of unit testing, reverse engineering, various kinds of forensics, and whatnot. In the development phase, one could ensure agreement with unit tests, but this doesn't seem to be useful for instrumented analyses. For reverse engineering, the goal can be code & data agreement, but methods for assessing fidelity of re-engineered code don't seem particularly easy to find. Forensics on a per-program basis are very easily found, but comparisons between programs don't seem to be that common.
(Making this answer community wiki, because dataflow programming and reactive programming are not my areas of expertise.)
The area of data flow programming appears to be related, and thus debugging of data flow programs may be helpful. This paper from 1981 gives several useful high level ideas. Although it's hard to translate these to immediately applicable code, it does suggest a method I'd overlooked: when approaching a program as a dataflow, one can either statically or dynamically identify where changes in input values cause changes in other values in the intermediate processing or in the output (not just changes in execution, if one were to examine control flow).
Although dataflow programming is often related to parallel or distributed computing, it seems to dovetail with Reactive Programming, which is how the monitoring of objects (e.g. the hashing) can be implemented.
This answer is far from adequate, hence the CW tag, as it doesn't really name the debugging method that I described. Perhaps this is a form of debugging for the reactive programming paradigm.
[Also note: although this answer is CW, if anyone has a far better answer in relation to dataflow or reactive programming, please feel free to post a separate answer and I will remove this one.]
Note 1: Henrik Nilsson and Peter Fritzson have a number of papers on debugging for lazy functional languages, which are somewhat related: the debugging goal is to assess values, not the execution of code. This paper seems to have several good ideas, and their work partially inspired this paper on a debugger for a reactive programming language called Lustre. These references don't answer the original question, but may be of interest to anyone facing this same challenge, albeit in a different programming context.

Bug distribution

I have a program that I'm porting from one language to another. I'm doing this with a translation program that I'm developing myself. The relevant result of this is that I expect that there are a number of bugs in my system that I am going to need to find and fix. Each bug is likely to manifest in many places and fixing it will fix the bug in all the places it shows up. (I feel like a have a really big lever and I'm pushing on the short end, I push really hard but when things move they move a lot.)
I have the ability to run execution log diffs so I'm measuring my progress by how far through the test suite I can run it before it diverges from the original program's execution. (Thank [whatever you want] for BeyondCompare, it works reasonably well with ~1M line files :D)
The question is: What shape should I expect to see if i were to plot that run length as a function of time? (more time == more bugs removed)
My first thought is something like a Poisson distribution. However because fixing each bug also removes all other occurrences of it, that shouldn't be quite correct.
(BTW this might have real world implications with regards to estimating when programs will finish being debugged.)
A more abstract statement of the problem:
Given an ordered list of N integers selected from the range [0,M] (where N>>M) with a uniform distribution along the positions in the list, but not necessarily with a uniform distribution of numbers. What is the expected location of that last “new” number? What about the second to last? Etc?
Engineers are always trained to look for an exponential curve:
bugs( t ) = ceil [ c1e –c2 t ] + R( t )
Where c1 and c2 are constants that depend on the number of test cases and your coding skills. R( ) is a random function whose amplitude and distribution depend on the Butterfly Effect, the amount of sleep you had last night, and the proximity of your deadline and manager.
Before you start coding, at t=0, all of your test cases will fail, yielding at least c1 bugs.
As you code and t increases, the exponent decreases asymptotically toward zero, causing the calculated number of bugs to eventually reach 1. This is because we all know "there's always one more bug."
In general, the number of new bugs found as a function of time should follow a Poisson-like distribution. Assuming bugs are fixed essentially as they are found, then the number of open bugs should follow the same distribution.
I actually used this early in my career to "prove" to my business unit that a particular feature set wasn't ready for release. I graphed new and open bugs as a function of time for the current project, and also for the previous two versions. The two older data sets showed an initial steep ramp up, a peak, and a gradual decline until their release dates. The current data showed a linear increase that continued to the day I created the graph.
We were given several more days of testing, and the testers were given training on how to test the product more effectively. Thanks to both decisions, the release was relatively defect-free.
