How does the PPMD decompression algorithm work? - algorithm

So we have a ppmd decompression code which was cut and pasted from Dmitry Shkarin's original code from 1997 (based on the few comments in it). The code itself is mostly uncommented, and I just can't find out how it works.
The code uses a suballocator, but it doesn't just allocates or deallocates from it, instead it manipulates the free block list directly from the calling code in various ways I can't decipher yet.
We have found a fuzzed sample that causes the code to crash, I was assigned to fix it.
But in order to tackle the problem I need to understand how does the decompression works (only interested in decompression).
Google wasn't very helpful either. Search results are dominated by the results of a gamer with identical nickname, or feature list of various archivers. I eventually found a Russian website where the algorithm specification can be found - only through the Wayback machine and in Russian - which I cannot read due to language barrier.
But it looks like it's only a mathematical description. So far I found nothing about the specification on how does a PPMD compressed data is laid out in compressed file or how it is consumed when decompressing.
Can anyone who understands the PPMD algorithm give me some pointers?
Ideally I'm looking for documents that explains the structure of PPMd encoded data. Something as detailed as the RFC1951 for the deflate.
UPDATE:
Well it turns out the code has quite a few fishy things.
For example this one:
MaxContext=FoundState->Successor; return;
}
*pText++ = FSymbol; Successor = (PPM_CONTEXT*) pText;
if (pText >= UnitsStart) goto RESTART_MODEL;
if ( FSuccessor ) {
if ((BYTE*) FSuccessor < UnitsStart)
It writes stuff into a byte buffer, then casts it into a struct that contains pointers.
Then in the CreateSuccessors functions we have another sorcery.
ct.oneState().Successor=(PPM_CONTEXT*) (((BYTE*) UpBranch)+1);
The UpBranch and ct.oneState().Successor are PPM_CONTEXT pointers. I can't imagine what would be the purpose of a statement like this. As I said this structure contains pointers which can be dereferenced eventually (I tried to set these pointers to NULL to see whether they are used). And it turns out they are indeed dereferenced! (at least in the second case).

Related

Is it possible to use CompUnit modules for collected data?

I have developed a module for processing a collection of documents.
One run of the software collects information about them. The data is stored in two structures called %processed and %symbols. The data needs to be cached for subsequent runs of the software on the same set of documents, some of which can change. (The documents are themselves cached using CompUnit modules).
Currently the data structures are stored / restored as follows:
# storing
'processed.raku`.IO.spurt: %processed.raku;
'symbols.raku`.IO.spurt: %symbol.raku;
# restoring
my %processed = EVALFILE 'processed.raku';
my %symbols = EVALFILE 'symbols.raku';
Outputting these structures into files, which can be quite large, can be slow because the hashes are parsed to create the Stringified forms, and slow on input because they are being recompiled.
It is not intended for the cached files to be inspected, only to save state between software runs.
In addition, although this is not a problem for my use case, this technique cannot be used in general because Stringification (serialisation) does not work for Raku closures - as far as I know.
I was wondering whether the CompUnit modules could be used because they are used to store compiled versions of modules. So perhaps, they could be used to store a 'compiled' or 'internal' version of the data structures?
Is there already a way to do this?
If there isn't, is there any technical reason it might NOT be possible?
(There's a good chance that you've already tried this and/or it isn't a good fit for your usecase, but I thought I'd mention it just in case it's helpful either to you or to anyone else who finds this question.)
Have you considered serializing the data to/from JSON with JSON::Fast? It has been optimized for (de)serialization speed in a way that basic stringification hasn't been / can't be. That doesn't allow for storing Blocks or other closures – as you mentioned, Raku doesn't currently have a good way to serialize them. But, since you mentioned that isn't an issue, it's possible that JSON would fit your usecase.
[EDIT: as you pointed out below, this can make support for some Raku datastructures more difficult. There are typically (but not always) ways to work around the issue by specifying the datatype as part of the serialization step:
use JSON::Fast;
my $a = <a a a b>.BagHash;
my $json = $a.&to-json;
my BagHash() $b = from-json($json);
say $a eqv $b # OUTPUT: «True»
This gets more complicated for datastructures that are harder to represent in JSON (such as those with non-string keys). The JSON::Class module could also be helpful, but I haven't tested its speed.]
After looking at other answers and looking at the code for Precompilation, I realised my original question was based on a misconception.
The Rakudo compiler generates an intermediate "byte code", which is then used at run time. Since modules are self-contained units for compilation purposes, they can be precompiled. This intermediate result can be cached, thus significantly speeding up Raku programs.
When a Raku program uses code that has already been compiled, the compiler does not compile it again.
I had thought of the precompilation cache as a sort of storage of the internal state of a program, but it is not quite that. That is why - I think - #ralph was confused by the question, because I was not asking the right sort of question.
My question is about the storage (and restoration) of data. JSON::Fast, as discussed by #codesections is very fast because it is used by the Rakudo compiler at a low level and so is highly optimised. Consequently, restructuring data upon restoration will be faster than restoring native data types because the slow rate-determining step is storage and restoration from "disk", which JSON does very quickly.
Interestingly, the CompUnit modules I mentioned use low level JSON functions that make JSON::Fast so quick.
I am now considering other ways of storing data using optimised routines, perhaps using a compression/archiving module. It will come down to testing which is fastest. It may be that the JSON route is the fastest.
So this question does not have a clear answer because the question itself is "incorrect".
Update As #RichardHainsworth notes, I was confused by their question, though felt it should be helpful to answer as I did. Based on his reaction, and his decision not to accept #codesection's answer, which at that point was the only other answer, I concluded it was best to delete this answer to encourage others to answer. But now Richard has provided an answer that provides good resolution, I'm undeleting it in the hope that's now more useful.
TL;DR Instead of using EVALFILE, store your data in a module which you then use. There are simple ways to do this that would be minimal but useful improvements over EVALFILE. There are more complex ways that might be better.
A small improvement over EVALFILE
I've decided to first present a small improvement so you can solidify your shift in thinking from EVALFILE. It's small in two respects:
It should only take a few minutes to implement.
It only gives you a small improvement over EVALFILE.
I recommend you properly consider the rest of this answer (which describes more complex improvements with potentially bigger payoffs instead of this small one) before bothering to actually implement what I describe in this first section. I think this small improvement is likely to turn out to be redundant beyond serving as a mental bridge to later sections.
Write a program, say store.raku, that creates a module, say data.rakumod:
use lib '.';
my %hash-to-store = :a, :b;
my $hash-as-raku-code = %hash-to-store .raku;
my $raku-code-to-store = "unit module data; our %hash = $hash-as-raku-code";
spurt 'data.rakumod', $raku-code-to-store;
(My use of .raku is of course overly simplistic. The above is just a proof of concept.)
This form of writing your data will have essentially the same performance as your current solution, so there's no gain in that respect.
Next, write another program, say, using.raku, that uses it:
use lib '.';
use data;
say %data::hash; # {a => True, b => True}
useing the module will entail compiling it. So the first time you use this approach for reading your data instead of EVALFILE it'll be no faster, just as it was no faster to write it. But it should be much faster for subsequent reads. (Until you next change the data and have to rebuild the data module.)
This section also doesn't deal with closure stringification, and means you're still doing a data writing stage that may not be necessary.
Stringifying closures; a hack
One can extend the approach of the previous section to include stringifications of closures.
You "just" need to access the source code containing the closures; use a regex/parse to extract the closures; and then write the matches to the data module. Easy! ;)
For now I'll skip filling in details, partly because I again think this is just a mental bridge and suggest you read on rather than try to do as I just described.
Using CompUnits
Now we arrive at:
I was wondering whether the CompUnit modules could be used because they are used to store compiled versions of modules. So perhaps, they could be used to store a 'compiled' or 'internal' version of the data structures?
I'm a bit confused by what you're asking here for two reasons. First, I think you mean the documents ("The documents are themselves cached using CompUnit modules"), and that documents are stored as modules. Second, if you do mean the documents are stored as modules, then why wouldn't you be able to store the data you want stored in them? Are you concerned about hiding the data?
Anyhow, I will presume that you are asking about storing the data in the document modules, and that you're interested in ways to "hide" that data.
One simple option would be to write the data as I did in the first section, but insert the our %hash = $hash-as-raku-code"; etc code at the end, after the actual document, rather than at the start.
But perhaps that's too ugly / not "hidden" enough?
Another option might be to add Pod blocks with Pod block configuration data at the end of your document modules.
For example, putting all the code into a document module and throwing in a say just as a proof-of-concept:
# ...
# Document goes here
# ...
# At end of document:
=begin data :array<foo bar> :hash{k1=>v1, k2=>v2} :baz :qux(3.14)
=end data
say $=pod[0].config<array>; # foo bar
That said, that's just code being executed within the module; I don't know if the compiled form of the module retains the config data. Also, you need to use a "Pod loader" (cf Access pod from another Raku file). But my guess is you know all about such things.
Again, this might not be hidden enough, and there are constraints:
The data can only be literal scalars of type Str, Int, Num, or Bool, or aggregations of them in Arrays or Hashs.
Data can't have actual newlines in it. (You could presumably have double quoted strings with \ns in them.)
Modifying Rakudo
Aiui, presuming RakuAST lands, it'll be relatively easy to write Rakudo plugins that can do arbitrary work with a Raku module. And it seems like a short hop from RakuAST macros to basic is parsed macros which in turn seem like a short hop from extracting source code (eg the source of closures) as it goes thru the compiler and then spitting it back out into the compiled code as data, possibly attached to Pod declarator blocks that are in turn attached to code as code.
So, perhaps just wait a year or two to see if RakuAST lands and gets the hooks you need to do what you need to do via Rakudo?

list_to_atom Failed Caused by System Limitation in Erlang

I tried to compile a asn file with Erlang's asn1ct:compile function. I run the following code:
asn1ct:compile("PDU-definitions", [per, verbose]).
then got the following errors:
...
{error,{system_limit,[{erlang,list_to_atom,
["enc_InterRATHandoverInfo_v390NonCriticalExtensions_present_v3a0NonCriticalExtensions_laterNonCriticalExtensions_v3g0NonCriticalExtensions_v4b0NonCriticalExtensions_v4d0NonCriticalExtensions_v590NonCriticalExtensions_v690NonCriticalExtensions_nonCriticalExtensions"], []},
...
I googled and found there's a 255-character length limitation of Erlang's atom. Because there are too many nested data structure in the ASN file, the length of corresponding atom exceed the limitation.
My question is: if I can modify the default length limitation to a bigger value, or there are some workarounds for this situation?
Thanks!
As of R17, there remains no way to modify the maximum character limit in Erlang outside of modifying the source and recompiling. A sparse look at asn1ct's documentation suggests no means of changing the atom-encoding behaviour, either.
Best bet that I saw was the "n2n" option of compiling, which instructs the compiler to generate functions for doing name-to-enumeration conversion. I assume that it will still construct atoms in this case, however, which would be a moot point.
Nothing else in the documentation suggests a way to change name-construction behaviour, and as such severely nested data structures will cause problems.

Should a method parameter name specify its unit in its name?

Of the following two options for method parameter names that have a unit as well as a value, which do you prefer and why? (I've used Java syntax, but my question would apply to most languages.)
public void move(int length)
or
public void move(int lengthInMetres)
Option (1) would seem to be sufficient, but I find that when I'm coding/typing, my IDE can indicate to me I need a length value, but I typically have to break stride and look up the method's doco to determine the units, so that I pass in the correct value (and not kilometres instead of metres for example). This can be an annoying interruption to a thought process. Option (2) alleviates this problem, but can be verbose, particularly if your unit is metresPerSecondSquared or some such. Which do you think is the best?
I would recommend making your parameter (and method) names as clear as possible, even if they become wordy. You'll be glad when you look at or use the code in 6 months time, or when someone else has to look at your code.
If you think the names are becoming too long, consider rewording them. In your example you could use parameter name int Metres that would probably be clear enough. Consider changing the method name, eg public void moveMetres(int length).
In Visual Studio, the XML comments generated when you enter 3 comment symbols above a method definition will appear in Intellisense hints when you use the method in other locations. Other IDEs may have similar functionality.
Abbreviations should be used sparingly. If absolutely necessary only use commonly known and/or relevant industry-standard abbreviations and be consistent, ie use the same abbreviation everywhere.
Take a step back. Write the code then move on to something else. Come back the next day and check to see if the names are still clear.
Peer reviews can help too. Ask someone who knows the programming language (or just thinks logically), but not the specific functionality, if your naming scheme is clear enough or to help brainstorm alternatives. They might be the poor sap who has to maintain your code in the future!
I would prefer the second approach (i.e. lengthInMeters) as it describes the input needed for the method accurately. The fact that you find it confusing to figure out the units when you are just writing the code would imply it would be much more confusing when you (or some one) looks at the same piece of code later. As regard to issue of the variable name being longer you can find ways to abbreviate it (say "mtrsPerSecondSquared").
Also in defence second approach, the book Code Complete mentions a research that indicates, effort required to debug a program was minimized when variables had names averaged to 10 to 16 characters.

What is the most obfuscated code you've had to fix?

Most programmers will have had the experience of debugging/fixing someone else's code. Sometimes that "someone else's code" is so obfuscated it's bad enough trying to understand what it's doing.
What's the worst (most obfuscated) code you've had to debug/fix?
If you didn't throw it away and recode it from scratch, well why didn't you?
PHP OSCommerce is enough to say, it is obfuscated code...
a Java class
only static methods that manipulates DOM
8000 LOCs
long chain of methods that return null on "error": a.b().c().d().e()
very long methods (400/500 LOC each)
nested if, while, like:
if (...) {
for (...) {
if (...) {
if (...) {
while (...) {
if (...) {
cut-and-paste oriented programming
no exceptions, all exceptions are catched and "handled" using printStackTrace()
no unit tests
no documentation
I was tempted to throw away and recode... but, after 3 days of hard debugging,
I've added the magic if :-)
Spaghetti code PHP CMS system.
by default, programmers think someone else's code is obfuscated.
The worse I probably had to do was interpreting what variables i1, i2 j, k, t were in a simple method and they were not counters in 'for' loops.
In all other circumstances I guess the problem area was difficult which made the code look difficult.
I found this line in our codebase today and thought it was a nice example of sneaky obfuscation:
if (MULTICLICK_ENABLED.equals(propService.getProperty(PropertyNames.MULTICLICK_ENABLED))) {} else {
return false;
}
Just making sure I read the whole line. NO SKIMREADING.
When working on a GWT project, I would reach parts of GWT-compiled obfuscated JS code which wasn't mine.
Now good luck debugging real obfuscated code.
I can't remember the full code, but a single part of it remains burned into my memory as something I spend hours trying to understand:
do{
$tmp = shift unless shift;
$tmp;
}while($tmp);
I couldn't understand it at first, it looks so useless, then I printed out #_ for a list of arguments, a series of alternating boolean and function names, the code was used in conjunction with a library detection module that changed behaviour if a function was broken, but the code was so badly documented and made of things like that which made no sense without a complete understanding of the full code I gave up and rewrote the whole thing.
UPDATE from DVK:
And, lest someone claims this was because Perl is unreadable as opposed to coder being a golf master instead of good software developer, here's the same code in a slightly less obfuscated form (the really correct code wouldn't even HAVE alternating sub names and booleans in the first place :)
# This subroutine take a list of alternating true/false flags
# and subroutine names; and executes the named subroutines for which flag is true.
# I am also weird, otherwise I'd have simply have passed list of subroutines to execute :)
my #flags_and_sub_names_list = #_;
while ( #flags_and_sub_names_list ) {
my $flag = shift #flags_and_sub_names_list;
my $subName = shift #flags_and_sub_names_list;
next unless $flag && $subName;
&{ $subName }; # Call the named subroutine
}
I've had a case of a 300lines function performing input sanitization which missed a certain corner case. It was parsing certain situations manually using IndexOf and Substring plus a lot of inlined variables and constants (looks like the original coder didn't know anything about good practices), and no comment was provided. Throwing it away wasn't feasible due to time constraints and the fact that I didn't have the specification required so rewriting it would've meant understanding the original, but after understanding it fixing it was just quicker. I also added lots of comments, so whoever shall come after me won't feel the same pain taking a look at it...
The Perl statement:
select((select(s),$|=1)[0])
which, at the suggestion of the original author (Randal Schwartz himself, who said he disliked it but nothing else was available at the time), was replaced with something a little more understandable:
IO::Handle->autoflush
Beyond that one-liner, some of the Java JDBC libraries from IBM are obfuscated and all variables and functions are either combinations of the letter 'l' and '1' or single/double characters - very hard to track anything down until you get them all renamed. Needed to do this to track down why they worked fine in IBM's JRE but not Sun's.
If you're talking about HLL codes, once I was updating project written by a chinese and all comments were chinese (stored in ansii) and it was a horror to understand some code fragments, if you're talking about low level code there were MANY of them (obfuscated, mutated, vm-ed...).
I once had to reverse engineer a Java 1.1 framework that:
Extended event-driven SAX parser classes for every class, even those that didn't parse XML (the overridden methods were simply invoked ad hoc by other code)
Custom runtime exceptions were thrown in lieu of method invocations wherever possible. As a result, most of the business logic landed in a nested series of catch blocks.
If I had to guess, it was probably someone's "smart" idea that method invocations were expensive in Java 1.1, so throwing exceptions for non-exceptional flow control was somehow considered an optimization.
Went through about three bottles of eye drops.
I once found a time bomb that had been intentionally obfuscated.
When I had finally decoded what it was doing I mentioned it to the manager who said they knew about the time bomb but had left it in place because it was so ineffective and was interwoven with other code.
The time bomb was (presumably) supposed to go off after a certain date.
Instead, it had a bug in it so it only activated if someone was working after lunchtime on Dec 31st.
It had taken three years for that circumstance to occur since the guy who wrote the time bomb left the company.

What is your personal approach/take on commenting?

Duplicate
What are your hard rules about commenting?
A Developer I work with had some things to say about commenting that were interesting to me (see below). What is your personal approach/take on commenting?
"I don't add comments to code unless
its a simple heading or there's a
platform-bug or a necessary
work-around that isn't obvious. Code
can change and comments may become
misleading. Code should be
self-documenting in its use of
descriptive names and its logical
organization - and its solutions
should be the cleanest/simplest way
to perform a given task. If a
programmer can't tell what a program
does by only reading the code, then
he's not ready to alter it.
Commenting tends to be a crutch for
writing something complex or
non-obvious - my goal is to always
write clean and simple code."
"I think there a few camps when it
comes to commenting, the
enterprisey-type who think they're
writing an API and some grand
code-library that will be used for
generations to come, the
craftsman-like programmer that thinks
code says what it does clearer than a
comment could, and novices that write
verbose/unclear code so as to need to
leave notes to themselves as to why
they did something."
There's a tragic flaw with the "self-documenting code" theory. Yes, reading the code will tell you exactly what it is doing. However, the code is incapable of telling you what it's supposed to be doing.
I think it's safe to say that all bugs are caused when code is not doing what it's supposed to be doing :). So if we add some key comments to provide maintainers with enough information to know what a piece of code is supposed to be doing, then we have given them the ability to fix a whole lot of bugs.
That leaves us with the question of how many comments to put in. If you put in too many comments, things become tedious to maintain and the comments will inevitably be out of date with the code. If you put in too few, then they're not particularly useful.
I've found regular comments to be most useful in the following places:
1) A brief description at the top of a .h or .cpp file for a class explaining the purpose of the class. This helps give maintainers a quick overview without having to sift through all of the code.
2) A comment block before the implementation of a non-trivial function explaining the purpose of it and detailing its expected inputs, potential outputs, and any oddities to expect when calling the function. This saves future maintainers from having to decipher entire functions to figure these things out.
Other than that, I tend to comment anything that might appear confusing or odd to someone. For example: "This array is 1 based instead of 0 based because of blah blah".
Well written, well placed comments are invaluable. Bad comments are often worse than no comments. To me, lack of any comments at all indicates laziness and/or arrogance on the part of the author of the code. No matter how obvious it is to you what the code is doing or how fantastic your code is, it's a challenging task to come into a body of code cold and figure out what the heck is going on. Well done comments can make a world of difference getting someone up to speed on existing code.
I've always liked Refactoring's take on commenting:
The reason we mention comments here is that comments often are used as a deodorant. It's surprising how often you look at thickly commented code and notice that the comments are there because the code is bad.
Comments lead us to bad code that has all the rotten whiffs we've discussed in the rest of this chapter. Our first action is to remove the bad smells by refactoring. When we're finished, we often find that the comments are superfluous.
As controversial as that is, it's rings true for the code I've read. To be fair, Fowler isn't saying to never comment, but to think about the state of your code before you do.
You need documentation (in some form; not always comments) for a local understanding of the code. Code by itself tells you what it does, if you read all of it and can keep it all in mind. (More on this below.) Comments are best for informal or semiformal documentation.
Many people say comments are a code smell, replaceable by refactoring, better naming, and tests. While this is true of bad comments (which are legion), it's easy to jump to concluding it's always so, and hallelujah, no more comments. This puts all the burden of local documentation -- too much of it, I think -- on naming and tests.
Document the contract of each function and, for each type of object, what it represents and any constraints on a valid representation (technically, the abstraction function and representation invariant). Use executable, testable documentation where practical (doctests, unit tests, assertions), but also write short comments giving the gist where helpful. (Where tests take the form of examples, they're incomplete; where they're complete, precise contracts, they can be as much work to grok as the code itself.) Write top-level comments for each module and each project; these can explain conventions that keep all your other comments (and code) short. (This supports naming-as-documentation: with conventions established, and a place we can expect to find subtleties noted, we can be confident more often that the names tell all we need to know.) Longer, stylized, irritatingly redundant Javadocs have their uses, but helped generate the backlash.
(For instance, this:
Perform an n-fold frobulation.
#param n the number of times to frobulate
#param x the x-coordinate of the center of frobulation
#param y the y-coordinate of the center of frobulation
#param z the z-coordinate of the center of frobulation
could be like "Frobulate n times around the center (x,y,z)." Comments don't have to be a chore to read and write.)
I don't always do as I say here; it depends on how much I value the code and who I expect to read it. But learning how to write this way made me a better programmer even when cutting corners.
Back on the claim that we document for the sake of local understanding: what does this function do?
def is_even(n): return is_odd(n-1)
Tests if an integer is even? If is_odd() tests if an integer is odd, then yes, that works. Suppose we had this:
def is_odd(n): return is_even(n-1)
The same reasoning says this is_odd() tests if an integer is odd. Put them together, of course, and neither works, even though each works if the other does. Change it a bit and we'd have code that does work, but only for natural numbers, while still locally looking like it works for integers. In microcosm that's what understanding a codebase is like: tracing dependencies around in circles to try to reverse-engineer assumptions the author could have explained in a line or two if they'd bothered. I hate the expense of spirit thoughtless coders have put me to this way over the past couple of decades: oh, this method looks like it has the side effect of farbuttling the warpcore... always? Well, if odd crobuncles desaturate, at least; do they? Better check all the crobuncle-handling code... which will pose its own challenges to understanding. Good documentation cuts this O(n) pointer-chasing down to O(1): e.g. knowing a function's contract and the contracts of the things it explicitly uses, the function's code should make sense with no further knowledge of the system. (Here, contracts saying is_even() and is_odd() work on natural numbers would tell us that both functions need to test for n==0.)
My only real rule is that comments should explain why code is there, not what it is doing or how it is doing it. Those things can change, and if they do the comments have to be maintained. The purpose the code exists in the first place shouldn't change.
the purpose of comments is to explain the context - the reason for the code; this, the programmer cannot know from mere code inspection. For example:
strangeSingleton.MoveLeft(1.06);
badlyNamedGroup.Ignite();
who knows what the heck this is for? but with a simple comment, all is revealed:
//when under attack, sidestep and retaliate with rocket bundles
strangeSingleton.MoveLeft(1.06);
badlyNamedGroup.Ignite();
seriously, comments are for the why, not the how, unless the how is unintuitive.
While I agree that code should be self-readable, I still see a lot of value in adding extensive comment blocks for explaining design decisions. For example "I did xyz instead of the common practice of abc because of this caveot ..." with a URL to a bug report or something.
I try to look at it as: If I'm dead and gone and someone straight out of college has to fix a bug here, what are they going to need to know?
In general I see comments used to explain poorly written code. Most code can be written in a way that would make comments redundant. Having said that I find myself leaving comments in code where the semantics aren't intuitive, such as calling into an API that has strange or unexpected behavior etc...
I also generally subscribe to the self-documenting code idea, so I think your developer friend gives good advice, and I won't repeat that, but there are definitely many situations where comments are necessary.
A lot of times I think it boils down to how close the implementation is to the types of ordinary or easy abstractions that code-readers in the future are going to be comfortable with or more generally to what degree the code tells the entire story. This will result in more or fewer comments depending on the type of programming language and project.
So, for example if you were using some kind of C-style pointer arithmetic in an unsafe C# code block, you shouldn't expect C# programmers to easily switch from C# code reading (which is probably typically more declarative or at least less about lower-level pointer manipulation) to be able to understand what your unsafe code is doing.
Another example is when you need to do some work deriving or researching an algorithm or equation or something that is not going to end up in your code but will be necessary to understand if anyone needs to modify your code significantly. You should document this somewhere and having at least a reference directly in the relevant code section will help a lot.
I don't think it matters how many or how few comments your code contains. If your code contains comments, they have to maintained, just like the rest of your code.
EDIT: That sounded a bit pompous, but I think that too many people forget that even the names of the variables, or the structures we use in the code, are all simply "tags" - they only have meaning to us, because our brains see a string of characters such as customerNumber and understand that it is a customer number. And while it's true that comments lack any "enforcement" by the compiler, they aren't so far removed. They are meant to convey meaning to another person, a human programmer that is reading the text of the program.
If the code is not clear without comments, first make the code a clearer statement of intent, then only add comments as needed.
Comments have their place, but primarily for cases where the code is unavoidably subtle or complex (inherent complexity is due to the nature of the problem being solved, not due to laziness or muddled thinking on the part of the programmer).
Requiring comments and "measuring productivity" in lines-of-code can lead to junk such as:
/*****
*
* Increase the value of variable i,
* but only up to the value of variable j.
*
*****/
if (i < j) {
++i;
} else {
i = j;
}
rather than the succinct (and clear to the appropriately-skilled programmer):
i = Math.min(j, i + 1);
YMMV
The vast majority of my commnets are at the class-level and method-level, and I like to describe the higher-level view instead of just args/return value. I'm especially careful to describe any "non-linearities" in the function (limits, corner cases, etc) that could trip up the unwary.
Typically I don't comment inside a method, except to mark "FIXME" items, or very occasionally some sort of "here be monsters" gotcha that I just can't seem to clean up, but I work very hard to avoid those. As Fowler says in Refactoring, comments tend to indicate smally code.
Comments are part of code, just like functions, variables and everything else - and if changing the related functionality the comment must also be updated (just like function calls need changing if function arguments change).
In general, when programming you should do things once in one place only.
Therefore, if what code does is explained by clear naming, no comment is needed - and this is of course always the goal - it's the cleanest and simplest way.
However, if further explanation is needed, I will add a comment, prefixed with INFO, NOTE, and similar...
An INFO: comment is for general information if someone is unfamiliar with this area.
A NOTE: comment is to alert of a potential oddity, such as a strange business rule / implementation.
If I specifically don't want people touching code, I might add a WARNING: or similar prefix.
What I don't use, and am specifically opposed to, are changelog-style comments - whether inline or at the head of the file - these comments belong in the version control software, not the sourcecode!
I prefer to use "Hansel and Gretel" type comments; little notes in the code as to why I'm doing it this way, or why some other way isn't appropriate. The next person to visit this code will probably need this info, and more often than not, that person will be me.
As a contractor I know that some people maintaining my code will be unfamiliar with the advanced features of ADO.Net I am using. Where appropriate, I add a brief comment about the intent of my code and a URL to an MSDN page that explains in more detail.
I remember learning C# and reading other people's code I was often frustrated by questions like, "which of the 9 meanings of the colon character does this one mean?" If you don't know the name of the feature, how do you look it up?! (Side note: This would be a good IDE feature: I select an operator or other token in the code, right click then shows me it's language part and feature name. C# needs this, VB less so.)
As for the "I don't comment my code because it is so clear and clean" crowd, I find sometimes they overestimate how clear their very clever code is. The thought that a complex algorithm is self-explanatory to someone other than the author is wishful thinking.
And I like #17 of 26's comment (empahsis added):
... reading the code will tell you exactly
what it is doing. However, the code is
incapable of telling you what it's
supposed to be doing.
I very very rarely comment. MY theory is if you have to comment it's because you're not doing things the best way possible. Like a "work around" is the only thing I would comment. Because they often don't make sense but there is a reason you are doing it so you need to explain.
Comments are a symptom of sub-par code IMO. I'm a firm believer in self documenting code. Most of my work can be easily translated, even by a layman, because of descriptive variable names, simple form, and accurate and many methods (IOW not having methods that do 5 different things).
Comments are part of a programmers toolbox and can be used and abused alike. It's not up to you, that other programmer, or anyone really to tell you that one tool is bad overall. There are places and times for everything, including comments.
I agree with most of what's been said here though, that code should be written so clear that it is self-descriptive and thus comments aren't needed, but sometimes that conflicts with the best/optimal implementation, although that could probably be solved with an appropriately named method.
I agree with the self-documenting code theory, if I can't tell what a peice of code is doing simply by reading it then it probably needs refactoring, however there are some exceptions to this, I'll add a comment if:
I'm doing something that you don't
normally see
There are major side effects or implementation details that aren't obvious, or won't be next year
I need to remember to implement
something although I prefer an
exception in these cases.
If I'm forced to go do something else and I'm having good ideas, or a difficult time with the code, then I'll add sufficient comments to tmporarily preserve my mental state
Most of the time I find that the best comment is the function or method name I am currently coding in. All other comments (except for the reasons your friend mentioned - I agree with them) feel superfluous.
So in this instance commenting feels like overkill:
/*
* this function adds two integers
*/
int add(int x, int y)
{
// add x to y and return it
return x + y;
}
because the code is self-describing. There is no need to comment this kind of thing as the name of the function clearly indicates what it does and the return statement is pretty clear as well. You would be surprised how clear your code becomes when you break it down into tiny functions like this.
When programming in C, I'll use multi-line comments in header files to describe the API, eg parameters and return value of functions, configuration macros etc...
In source files, I'll stick to single-line comments which explain the purpose of non-self-evident pieces of code or to sub-section a function which can't be refactored to smaller ones in a sane way. Here's an example of my style of commenting in source files.
If you ever need more than a few lines of comments to explain what a given piece of code does, you should seriously consider if what you're doing can't be done in a better way...
I write comments that describe the purpose of a function or method and the results it returns in adequate detail. I don't write many inline code comments because I believe my function and variable naming to be adequate to understand what is going on.
I develop on a lot of legacy PHP systems that are absolutely terribly written. I wish the original developer would have left some type of comments in the code to describe what was going on in those systems. If you're going to write indecipherable or bad code that someone else will read eventually, you should comment it.
Also, if I am doing something a particular way that doesn't look right at first glance, but I know it is because the code in question is a workaround for a platform or something like that, then I'll comment with a WARNING comment.
Sometimes code does exactly what it needs to do, but is kind of complicated and wouldn't be immediately obvious the first time someone else looked at it. In this case, I'll add a short inline comment describing what the code is intended to do.
I also try to give methods and classes documentation headers, which is good for intellisense and auto-generated documentation. I actually have a bad habit of leaving 90% of my methods and classes undocumented. You don't have time to document things when you're in the middle of coding and everything is changing constantly. Then when you're done you don't feel like going back and finding all the new stuff and documenting it. It's probably good to go back every month or so and just write a bunch of documentation.
Here's my view (based on several years of doctoral research):
There's a huge difference between commenting functions (sort of a black box use, like JavaDocs), and commenting actual code for someone who will read the code ("internal commenting").
Most "well written" code shouldn't require much "internal commenting" because if it performs a lot then it should be broken into enough function calls. The functionality for each of these calls is then captured in the function name and in the function comments.
Now, function comments are indeed the problem, and in some ways your friend is right, that for most code there is no economical incentive for complete specifications the way that popular APIs are documented. The important thing here is to identify what are the "directives": directives are those information pieces that directly affect clients, and require some direct action (and are often unexpected). For example, X must be invoked before Y, don't call this from outside a UI thread, be aware that this has a certain side effect, etc. These are the things that are really important to capture.
Since most people never read full function documentations, and skim what they do read, you can actually increase the chances of awareness by capturing only the directives rather than the whole description.
I comment as much as needed - then, as much as I will need it a year later.
We add comments which provide the API reference documentation for all public classes / methods / properties / etc... This is well worth the effort because XML Documentation in C# has the nice effect of providing IntelliSense to users of these public APIs. .NET 4.0's code contracts will enable us to improve further on this practice.
As a general rule, we do not document internal implementations as we write code unless we are doing something non-obvious. The theory is that while we are writing new implementations, things are changing and comments are more likely than not to be wrong when the dust settles.
When we go back in to work on an existing piece of code, we add comments when we realize that it's taking some thought to figure out what in the heck is going on. This way, we wind up with comments where they are more likely to be correct (because the code is more stable) and where they are more likely to be useful (if I'm coming back to a piece of code today, it seems more likely that I might come back to it again tomorrow).
My approach:
Comments bridge the gap between context / real world and code. Therefore, each and every single line is commented, in correct English language.
I DO reject code that doesn't observe this rule in the strictest possible sense.
Usage of well formatted XML - comments is self-evident.
Sloppy commenting means sloppy code!
Here's how I wrote code:
if (hotel.isFull()) {
print("We're fully booked");
} else {
Guest guest = promptGuest();
hotel.checkIn(guest);
}
here's a few comments that I might write for that code:
// if hotel is full, refuse checkin, otherwise
// prompt the user for the guest info, and check in the guest.
If your code reads like a prose, there is no sense in writing comments that simply repeats what the code reads since the mental processing needed for reading the code and the comments would be almost equal; and if you read the comments first, you will still need to read the code as well.
On the other hand, there are situations where it is impossible or extremely difficult to make the code looks like a prose; that's where comment could patch in.

Resources