Dealing with code movement when comparing static analysis reports - static-analysis

When I run a static analysis tool over my codebase, and I get results like this:
...
arch/powerpc/kernel/time.c:102:5: warning: symbol 'decrementer_max' was not declared. Should it be static?
arch/powerpc/kernel/time.c:138:1: warning: symbol 'rtc_lock' was not declared. Should it be static?
arch/powerpc/kernel/time.c:361:37: warning: implicit cast to nocast type
...
I want to keep track of the number of warnings and where they are in the code as people make changes.
I could just diff the results of the static analysis runs, but then if someone inserts some code in time.c at line 50, the warnings above will move, and because the line numbers have changed, diff will tell me that they've changed.
How should I go about comparing these in a way that deals with movement of code within a file?
Googling for 'smart diff', etc hasn't been productive: they're mostly smart diffs of code rather than smart diff of logs. Log analysis tools like Greylog or Kibana also seem like a poor fit, designed more for different and more general analysis rather than for this quite specific task.
Is there something obvious that I'm missing? Or is this a problem where I should expect to be writing my own tooling?

You could maintain a merge of the code and the errors: insert each error message (minus its line number) after the corresponding line of code. Then if someone inserts code at line 50, the (updated) merge will not have diffs around the later error points. It'll have a diff at line 50, of course, which you may or may not be interested in. If you like, you can ignore diff-chunks that don't involve an error message (for which you'd need some distinctive marker at each inserted error message).

I had a go with a slightly simpler setup - as #ajd suggested, parsing the messages, and doing line-number-insensitive matching.
The code is up at https://github.com/daxtens/smart-sparse-diff

Related

Adding keyword commands and functions to Textmate 2 bundle syntax

I want to add some additional syntax highlighting definitions to an existing bundle, but I need some general advice on how to do this. I'm not building a syntax from scratch, and I think my request is pretty simple, but I think it involves some subtleties for which I find the manual somewhat impenetrable in finding the answer.
Basically, I'm trying to fill out the syntax definitions for the Stata bundle. It's great, but there is no built in support for automatically highlighting the base commands and the installed functions, only a handful of basic control statements. Stata is a language which is primarily used by calling lots of different high level pre-defined command calls, like command foo bar, options(). The convention is that these command calls be highlighted.
There are a ton of these commands, and stubs which are used for convenience. Just the base install has almost 3500. Even optimizing them using the bundle helper, which obviously gets rid of the stub issue, still yields a massive regex list. I can easily cut this down to less than 1000 important ones, but its still a lot. There are also 350 "functions" which I would like to match with the syntax function()
I essentially have 3 questions:
Am I creating a serious problem by including a very comprehensive list of matching definitions?
How do I restrict a command to only highlight when it either begins a line or there is only whitespace between the beginning line and the command
What is the preferred way of restricting the list of functions() to only highlight when they have attached parentheses?

How to create AST parser which allows syntax errors?

First, what to read about parsing and building AST?
How to create parser for a language (like SQL) that will build an AST and allow syntax errors?
For example, for "3+4*5":
+
/ \
3 *
/ \
4 5
And for "3+4*+" with syntax error, parser would guess that the user meant:
+
/ \
3 *
/ \
4 +
/ \
? ?
Where to start?
SQL:
SELECT_________________
/ \ \
. FROM JOIN
/ \ | / \
a city_name people address ON
|
=______________
/ \
.____ .
/ \ / \
p address_id a id
The standard answer to the question of how to build parsers (that build ASTs), is to read the standard texts on compiling. Aho and Ullman's "Dragon" Compiler book is pretty classic. If you haven't got the patience to get the best reference materials, you're going to have more trouble, because they provide theory and investigate subtleties. But here is my answer for people in a hurry, building recursive descent parsers.
One can build parsers with built-in error recovery. There are many papers on this sort of thing, a hot topic in the 1980s. Check out Google Scholar, hunt for "syntax error repair". The basic idea is that the parser, on encountering a parsing error, skips to some well-known beacon (";" a statement delimiter is pretty popular for C-like languages, which is why you got asked in a comment if your language has statement terminators), or proposes various input stream deletions or insertions to climb over the point of the syntax error. The sheer variety of such schemes is surprising. The key idea is generally to take into account as much information around the point of error as possible. One of the most intriguing ideas I ever saw had two parsers, one running N tokens ahead of the other, looking for syntax-error land-mines, and the second parser being feed error repairs based on the N tokens available before it encounters the syntax error. This lets the second parser choose to act differently before arriving at the syntax error. If you don't have this, most parser throw away left context and thus lose the ability to repair. (I never implemented such a scheme.)
The choice of things to insert can often be derived from information used to build the parser (often First and Follow sets) in the first place. This is relatively easy to do with L(AL)R parsers, because the parse tables contain the necessary information and are available to the parser at the point where it encounters an error. If you want to understand how to do this, you need to understand the theory (oops, there's that compiler book again) of how the parsers are constructed. (I have implemented this scheme successfully several times).
Regardless of what you do, syntax error repair doesn't help much, because it is almost impossible to guess what the writer of the parsed document actually intended. This suggests fancy schemes won't be really helpful. I stick to simple ones; people are happy to get an error report and some semi-graceful continuation of parsing.
A real problem with rolling your own parser for a real language, is that real languages are nasty messy things; people building real implementations get it wrong and frozen in stone because of existing code bases, or insist on bending/improving the language (standards are for wimps, goodies are for marketing) because its cool. Expect to spend a lot of time re-calibrating what you think the grammar is, against the ground truth of real code. As a general rule, if you want a working parser, better to get one that has a track record rather than roll it yourself.
A lesson most people that are hell-bent to build a parser don't get, is that if they want to do anything useful with the parse result or tree, they'll need a lot more basic machinery than just the parser. Check my bio for "Life After Parsing".
There are two things the parser could do:
Report the error and have the user try again.
Repair the error and proceed.
Generally speaking the first one is easier (and safer). There may not always be enough information for the parser to infer the intent when the syntax is wrong. Depending on the circumstances, it may be dangerous to proceed with a repair that makes the input syntactically correct but semantically wrong.
I've written a few hand-rolled recursive descent parsers for little languages. When writing code to interpret the grammar rules explicitly (as opposed to using a parser-generator), it's easy to detect errors, because the next token doesn't fit the production rule. Generated parsers tend to spit out a simplistic "expected $(TOKEN_TYPE) here" message, which isn't always useful to the user. With a hand-written parser, it's often easy to give a more specific diagnostic message, but it can be time consuming to cover every case.
If your goal is the report the problem but to keep parsing (so that you can see if there are additional problems), you can put a special AST node in the tree at the point of the error. This keeps the tree from falling apart.
You then have to resync to some point beyond the error in order to continue parsing. As Ira Baxter mentioned in his answer, you might look for a token, like ';', that separates statements. The correct token(s) to look for depends on the language you're parsing. Another possibility is to guess what the user meant (e.g., infer an extra token or a different token at the point the error was detected) and then continue. If you encounter another syntax error within the next few tokens, you could backtrack, make a different guess, and try again.

Find write statement in Fortran

I'm using Fortran for my research and sometimes, for debugging purposes, someone will insert in the code something like this:
write(*,*) 'Variable x:', varx
The problem is that sometimes it happens that we forget to remove that statement from the code and it becomes difficult to find where it is being printed. I usually can get a good idea where it is by the name 'Variable x' but it sometimes happens that that information might no be present and I just see random numbers showing up.
One can imagine that doing a grep for write(*,*) is basically useless so I was wondering if there is an efficient way of finding my culprit, like forcing every call of write(*,*) to print a file and line number, or tracking stdout.
Thank you.
Intel's Fortran preprocessor defines a number of macros, such as __file__ and __line__ which will be replaced by, respectively, the file name (as a string) and line number (as an integer) when the pre-processor runs. For more details consult the documentation.
GFortran offers similar facilities, consult the documentation.
Perhaps your compiler offers similar capabilities.
As has been previously implied, there's no Fortran--although there may be a compiler approach---way to change the behaviour of the write statement as you want. However, as your problem is more to do with handling (unintentionally produced) bad code there are options.
If you can't easily find an unwanted write(*,*) in your code that suggests that you have many legitimate such statements. One solution is to reduce the count:
use an explicit format, rather than list-directed output (* as the format);
instead of * as the output unit, use output_unit from the intrinsic module iso_fortran_env.
[Having an explicit format for "proper" output is a good idea, anyway.]
If that fails, use your version control system to compare an old "good" version against the new "bad" version. Perhaps even have your version control system flag/block commits with new write(*,*)s.
And if all that still doesn't help, then the pre-processor macros previously mentioned could be a final resort.

Code duplication refactoring tool for VB

I need a very specific tool for VB (or multi-language). I thought I would ask if one already exists, before I start making one myself (probably, in python).
What I need:
The tool must crawl, recursivelly or not, a path, searching for a list of extension, such as .bas, .frm, .xxx
Then, It has to parse that files, searching for functions, routines, etc.
And finally, it must output what it found.
I based this on the idea of, "reducing code redundance", in an scenario where, bad programmers make a lot of functions that do the same thing, sometimes with the same name, sometimes not. There are 4 cases:
Case 1: Same name, Same content.
Case 2: Same name, Diff content.
Case 3: Diff name, Same content.
Case 4: Diff name, Diff Content.
So, the output, should be something like this
===========================================================================
RESULT
===========================================================================
Errors:
---------------------------------------------------------------------------
==Name, ==Content --> 3: (Func(), Foo(), Bar()) In files (f,f2,f3)
!=Name, ==Content --> 2: (Func() + Func1(), Bar() + Bar1()) In Files (f4)
---------------------------------------------------------------------------
Warnings:
==Name, !=Content --> 1 (Foobar()) In Files (f19)
---------------------------------------------------------------------------
This is to give you an idea of what I need.
So, the question is: is there any tool that acomplish something similar to this???
P.S: Yes, we should write good code, in first instance, but, you know, stuff happens.
What you want is a "clone detector". These tools find copy-and-pasted code across a large set of designated files. Clones are not just of functions; they can be code blocks, data declarations, etc.
There are a variety of detectors out there, and you should know how they work before you attempt to build one of your own.
Some simply match lines for exact equivalence. While these demonstrate the basic idea, their detection is not good because they don't take into account the fact that cloned code often has variations; what people really do is clone-and-edit when making copies.
Some match sequences of langauge tokens, e.g., identifiers, keywords, literals, punctuation. These at least are relatively tolerant of whitespace changes. And they can find clones in which single tokens have been substituted for single tokens. However, because they don't understand language structure (blocks, statements, function bodies) they often match sequences that cross such structure boundaries (e.g., "} {" is often considered a clone by these tools), they produce rather high false-positive indications of (non)clones. Some of these attempt to limit the matches to key program structures, such as complete functions, as you have kind of suggested.
More sophisticated detectors match program structures.
Our CloneDR (I'm the original author) is a detector that
uses compiler-quality parsing to abstract syntax trees, which extracts the precise structure of the code. It does this for many languages (including VB6 and VBScript), locating clones as arbitrary functions, blocks, statements or declarations, with parameters shows how the clones vary. CloneDR can find clones in spite of formatting changes, changes in comment locations or content, and even variations where complex constructs (multiple statements or expressions) have been used as alternatives to simple ones (e.g., a single statment or a literal). While it tends to have a high detection rate(it usually finds 10-20% removable redundancy!), its false-positive rate tends to be considerably lower than the token based detectors. You can see sample reports for
a variety of different langauges at the link above.
See Comparison and Evaluation of Code Clone Detection Techniques and Tools: A Qualitative Approach which explicitly discusses different approaches and benefits, and compares a large number of detectors including CloneDR.
EDIT October 2010: ... When I first wrote this response, I assumed the OP was interested in VB.net, which CloneDR didn't do. We've since added VB.net, VB6 and VBScript capability to CloneDR. (Parsing VB.net in its modern form is a lot messier than one might imagine for "simple"(!) langauge like Visual Basic).

Cross version line matching

I'm considering how to do automatic bug tracking and as part of that I'm wondering what is available to match source code line numbers (or more accurate numbers mapped from instruction pointers via something like addr2line) in one version of a program to the same line in another. (Assume everything is in some kind of source control and is available to my code)
The simplest approach would be to use a diff tool/lib on the files and do some math on the line number spans, however this has some limitations:
It doesn't handle cross file motion.
It might not play well with lines that get changed
It doesn't look at the information available in the intermediate versions.
It provides no way to manually patch up lines when the diff tool gets things wrong.
It's kinda clunky
Before I start diving into developing something better:
What already exists to do this?
What features do similar system have that I've not thought of?
Why do you need to do this? If you use decent source version control, you should have access to old versions of the code, you can simply provide a link to that so people can see the bug in its original place. In fact the main problem I see with this system is that the bug may have already been fixed, but your automatic line tracking code will point to a line and say there's a bug there. Seems this system would be a pain to build, and not provide a whole lot of help in practice.
My suggestion is: instead of trying to track line numbers, which as you observed can quickly get out of sync as software changes, you should decorate each assertion (or other line of interest) with a unique identifier.
Assuming you're using C, in the case of assertions, this could be as simple as changing something like assert(x == 42); to assert(("check_x", x == 42)); -- this is functionally identical, due to the semantics of the comma operator in C and the fact that a string literal will always evaluate to true.
Of course this means that you need to identify a priori those items that you wish to track. But given that there's no generally reliable way to match up source line numbers across versions (by which I mean that for any mechanism you could propose, I believe I could propose a situation in which that mechanism does the wrong thing) I would argue that this is the best you can do.
Another idea: If you're using C++, you can make use of RAII to track dynamic scopes very elegantly. Basically, you have a Track class whose constructor takes a string describing the scope and adds this to a global stack of currently active scopes. The Track destructor pops the top element off the stack. The final ingredient is a static function Track::getState(), which simply returns a list of all currently active scopes -- this can be called from an exception handler or other error-handling mechanism.

Resources