Throw error for unclosed comment block javacc - comments

I am creating a lexer in javacc that skips block comments that start with /* and end with */. I have it working correctly for valid block comments but I am trying to figure out a way to throw an error when a block comment is unclosed...
Example:
/* this is not a valid block comment
/* this is a valid block comment*/
Here is what I have to skip valid block comments:
MORE: { <"/*"> : BLC_CMNT_ST}
<BLC_CMNT_ST> SKIP: { <"*/">: DEFAULT >
<BLC_CMNT_ST> MORE: { <~[]>}
Currently when I run the lexer a TokenMgrError is thrown when there is unclosed block comment. I would like to catch this error and/or throw my own error that displays the matchedToken.image. I have tried a few different ways but have ran into issues so any help would be greatly appreciated

How about
SKIP: { <"/*"> : BLC_CMNT_ST}
<BLC_CMNT_ST> SKIP: { "*/" : DEFAULT }
<BLC_CMNT_ST> SKIP: { < ~[] > }
<*> TOKEN : { <EOF>
{ System.out.println("Lexical state is " + curLexState ) ;
if(curLexState==BLC_CMNT_ST) throw new Error("Unmatched comment at end of file.") ; } }
I had to use SKIP instead of MORE for reasons I don't fully understand.
If you want to disallow "/*" inside of block comments you can add this production
<BLC_CMNT_ST> TOKEN: { < "/*" >
{ if(true) throw new Error("Unmatched comment at line "
+ matchedToken.beginLine
+ ", column "
+ matchedToken.beginColumn + ".") ; } }
Unfortunately this solution does not give you access to the image of the comment.

Related

Groovy compiler fails for nested omitted parentheses in method calls

In Groovy, parentheses can be omitted when there is no ambiguity. However, the groovy compiler fails for this piece of code:
def firstChar(String str) { str[0] }
println " ".split(firstChar " ")
I have troubles understanding what is ambiguous here. The error is as follows:
Groovyc: Unexpected input: '"".split(firstChar " "'
In my actual use-case, the error reports completely unrelated element. For this code:
existingInputFile.withReader { reader ->
def outputFile = new File(/name.txt/)
outputFile.createNewFile()
outputFile.withWriter { writer ->
writer.write reader.lines()
.map { line -> line.split " " }
.map { line -> "${line.head()} ${line[1]}}" }
.collect(Collectors.joining "\n")
}
}
It complains about:
Groovyc: Unexpected input: '{'
pointing out to the very first line of the above snippet.
As you can see, I have a "nested" omitted parentheses in a method call in form of writer.write and Collectors.joining.
Is this a compiler bug or can something like that really be ambiguous?
I would use a "standard" Groovy to fulfill your use-case:
// the slashy string should be used for regex only
new File('name.txt').withWriter { writer ->
existingInputFile.splitEachLine( / / ){ __, first, second ->
writer.write "$first $second\n"
}
}
without meddling with stream api and groovyc tricks.

C++ empty() and all_of() for checking string is empty or have only digit

I'm creating an IO console application and at the inputs i got an 'while' loop with two condition
empty() and all_of(), the function all_of() seems to work properly but when i press enter the empty() function not working and just let me to input the next thing in the 'struct'. I'm not sure am i doing it correct..There is the part of the code
cout << "Enter age: ";
getline(cin, age_str);
while(!age_str.empty() && !all_of(age_str.begin(), age_str.end(), ::isdigit)){
cout << "--Please Enter an integer-- " << endl;
cin.clear();
getline(cin, age_str);
}
stringstream(age_str) >> person_arr[n].age;
There are a link to the full code : enter link description here
The logic of the conditional of the while is incorrect.
What you need to do is:
If the line is empty, get the next line.
If the line is not empty and the line has anything other than digits, get the next line.
!age_str.empty() && !all_of(age_str.begin(), age_str.end(), ::isdigit) does not do that.
You need to use age_str.empty() || (!all_of(age_str.begin(), age_str.end(), ::isdigit))
I always recommend, when in doubt, simplify.
while ( !is_input_valid(age_str)) )
{
...
}
where
bool is_input_valid(std::string const& input)
{
if ( input.empty() )
{
return false;
}
return std::all_of(input.begin(), input.end(), ::isdigit);
}

split text file according to brackets or parantheses (top-level only) in terminal

I have several text files (utf-8) that I want to process in shell script. They aren't excactly the same format, but if I could only break them up into edible chunks I can handle that.
This could be programmed in C or python, but I prefer not.
EDIT: I wrote a solution in C; see my own answer. I think this may be the simplest approach after all. If you think I'm wrong please test your solution against the more complicated example input from my answer below.
-- jcxz100
For clarity (and to be able to debug more easily) I want the chunks to be saved as separate text files in a sub-folder.
All types of input files consist of:
junk lines
lines with junk text followed by start brackets or parentheses - i.e. '[' '{' '<' or '(' - and possibly followed by payload
payload lines
lines with brackets or parentheses nested within the top-level pairs; treated as payload too
payload lines with end brackets or parantheses - i.e. ']' '}' '>' or ')' - possibly followed by something (junk text and/or start of a new payload)
I want to break up the input according to only the matching pairs of top-level brackets/parantheses.
Payload inside these pairs must not be altered (including newlines and whitespace).
Everything outside the toplevel pairs should be discarded as junk.
Any junk or payload inside double-quotes must be considered atomic (handled as raw text, thus any brackets or parentheses inside should also be treated as text).
Here is an example (using only {} pairs):
junk text
"atomic junk"
some junk text followed by a start bracket { here is the actual payload
more payload
"atomic payload"
nested start bracket { - all of this line is untouchable payload too
here is more payload
"yet more atomic payload; this one's got a smiley ;-)"
end of nested bracket pair } - all of this line is untouchable payload too
this is payload too
} trailing junk
intermittent junk
{
payload that goes in second output file }
end junk
...sorry: Some of the input files really are as messy as that.
The first output file should be:
{ here is the actual payload
more payload
"atomic payload"
nested start bracket { - all of this line is untouchable payload too
here is more payload
"yet more atomic payload; this one's got a smiley ;-)"
end of nested bracket pair } - all of this line is untouchable payload too
this is payload too
}
... and the second output file:
{
payload that goes in second output file }
Note:
I haven't quite decided whether it's necesary to keep the pair of start/end characters in the output or if they themselves should be discarded as junk.
I think a solution that keeps them in is more general use.
There can be a mix of types of top-level bracket/paranthesis pairs in the same input file.
Beware: There are * and $ characters in the input files, so please avoid confusing bash ;-)
I prefer readability over brevity; but not at an exponential cost of speed.
Nice-to-haves:
There are backslash-escaped double-quotes inside the text; preferably they should be handled
(I have a hack, but it's not pretty).
The script oughtn't break over mismatched pairs of brackets/parentheses in junk and/or payload (note: inside the atomics they must be allowed!)
More-far-out-nice-to-haves:
I haven't seen it yet, but one could speculate that some input might have single-quotes rather than double-quotes to denote atomic content... or even a mix of both.
It would be nice if the script could be easily modified to parse input of similar structure but with different start/end characters or strings.
I can see this is quite a mouthful, but I think it wouldn't give a robust solution if I broke it down into simpler questions.
The main problem is splitting up the input correctly - everything else can be ignored or "solved" with hacks, so
feel free to ignore the nice-to-haves and the more-far-out-nice-to-haves.
Given:
$ cat file
junk text
"atomic junk"
some junk text followed by a start bracket { here is the actual payload
more payload
"atomic payload"
nested start bracket { - all of this line is untouchable payload too
here is more payload
"yet more atomic payload; this one's got a smiley ;-)"
end of nested bracket pair } - all of this line is untouchable payload too
this is payload too
} trailing junk
intermittent junk
{
payload that goes in second output file }
end junk
This perl file will extract the blocks you describe into files block_1, block_2, etc:
#!/usr/bin/perl
use v5.10;
use warnings;
use strict;
use Text::Balanced qw(extract_multiple extract_bracketed);
my $txt;
while (<>){$txt.=$_;} # slurp the file
my #blocks = extract_multiple(
$txt,
[
# Extract {...}
sub { extract_bracketed($_[0], '{}') },
],
# Return all the fields
undef,
# Throw out anything which does not match
1
);
chdir "/tmp";
my $base="block_";
my $cnt=1;
for my $block (#blocks){ my $fn="$base$cnt";
say "writing $fn";
open (my $fh, '>', $fn) or die "Could not open file '$fn' $!";
print $fh "$block\n";
close $fh;
$cnt++;}
Now the files:
$ cat block_1
{ here is the actual payload
more payload
"atomic payload"
nested start bracket { - all of this line is untouchable payload too
here is more payload
"yet more atomic payload; this one's got a smiley ;-)"
end of nested bracket pair } - all of this line is untouchable payload too
this is payload too
}
$ cat block_2
{
payload that goes in second output file }
Using Text::Balanced is robust and likely the best solution.
You can do the blocks with a single Perl regex:
$ perl -0777 -nlE 'while (/(\{(?:(?1)|[^{}]*+)++\})|[^{}\s]++/g) {if ($1) {$cnt++; say "block $cnt:== start:\n$1\n== end";}}' file
block 1:== start:
{ here is the actual payload
more payload
"atomic payload"
nested start bracket { - all of this line is untouchable payload too
here is more payload
"yet more atomic payload; this one's got a smiley ;-)"
end of nested bracket pair } - all of this line is untouchable payload too
this is payload too
}
== end
block 2:== start:
{
payload that goes in second output file }
== end
But that is a little more fragile than using a proper parser like Text::Balanced...
I have a solution in C. It would seem there's too much complexity for this to be easily achieved in shell script.
The program isn't overly complicated but nevertheless has more than 200 lines of code, which include error checking, some speed optimization, and other niceties.
Source file split-brackets-to-chunks.c:
#include <stdio.h>
/* Example code by jcxz100 - your problem if you use it! */
#define BUFF_IN_MAX 255
#define BUFF_IN_SIZE (BUFF_IN_MAX+1)
#define OUT_NAME_MAX 31
#define OUT_NAME_SIZE (OUT_NAME_MAX+1)
#define NO_CHAR '\0'
int main()
{
char pcBuff[BUFF_IN_SIZE];
size_t iReadActual;
FILE *pFileIn, *pFileOut;
int iNumberOfOutputFiles;
char pszOutName[OUT_NAME_SIZE];
char cLiteralChar, cAtomicChar, cChunkStartChar, cChunkEndChar;
int iChunkNesting;
char *pcOutputStart;
size_t iOutputLen;
pcBuff[BUFF_IN_MAX] = '\0'; /* ... just to be sure. */
iReadActual = 0;
pFileIn = pFileOut = NULL;
iNumberOfOutputFiles = 0;
pszOutName[OUT_NAME_MAX] = '\0'; /* ... just to be sure. */
cLiteralChar = cAtomicChar = cChunkStartChar = cChunkEndChar = NO_CHAR;
iChunkNesting = 0;
pcOutputStart = (char*)pcBuff;
iOutputLen = 0;
if ((pFileIn = fopen("input-utf-8.txt", "r")) == NULL)
{
printf("What? Where?\n");
return 1;
}
while ((iReadActual = fread(pcBuff, sizeof(char), BUFF_IN_MAX, pFileIn)) > 0)
{
char *pcPivot, *pcStop;
pcBuff[iReadActual] = '\0'; /* ... just to be sure. */
pcPivot = (char*)pcBuff;
pcStop = (char*)pcBuff + iReadActual;
while (pcPivot < pcStop)
{
if (cLiteralChar != NO_CHAR) /* Ignore this char? */
{
/* Yes, ignore this char. */
if (cChunkStartChar != NO_CHAR)
{
/* ... just write it out: */
fprintf(pFileOut, "%c", *pcPivot);
}
pcPivot++;
cLiteralChar = NO_CHAR;
/* End of "Yes, ignore this char." */
}
else if (cAtomicChar != NO_CHAR) /* Are we inside an atomic string? */
{
/* Yup; we are inside an atomic string. */
int bBreakInnerWhile;
bBreakInnerWhile = 0;
pcOutputStart = pcPivot;
while (bBreakInnerWhile == 0)
{
if (*pcPivot == '\\') /* Treat next char as literal? */
{
cLiteralChar = '\\'; /* Yes. */
bBreakInnerWhile = 1;
}
else if (*pcPivot == cAtomicChar) /* End of atomic? */
{
cAtomicChar = NO_CHAR; /* Yes. */
bBreakInnerWhile = 1;
}
if (++pcPivot == pcStop) bBreakInnerWhile = 1;
}
if (cChunkStartChar != NO_CHAR)
{
/* The atomic string is part of a chunk. */
iOutputLen = (size_t)(pcPivot-pcOutputStart);
fprintf(pFileOut, "%.*s", iOutputLen, pcOutputStart);
}
/* End of "Yup; we are inside an atomic string." */
}
else if (cChunkStartChar == NO_CHAR) /* Are we inside a chunk? */
{
/* No, we are outside a chunk. */
int bBreakInnerWhile;
bBreakInnerWhile = 0;
while (bBreakInnerWhile == 0)
{
/* Detect start of anything interesting: */
switch (*pcPivot)
{
/* Start of atomic? */
case '"':
case '\'':
cAtomicChar = *pcPivot;
bBreakInnerWhile = 1;
break;
/* Start of chunk? */
case '{':
cChunkStartChar = *pcPivot;
cChunkEndChar = '}';
break;
case '[':
cChunkStartChar = *pcPivot;
cChunkEndChar = ']';
break;
case '(':
cChunkStartChar = *pcPivot;
cChunkEndChar = ')';
break;
case '<':
cChunkStartChar = *pcPivot;
cChunkEndChar = '>';
break;
}
if (cChunkStartChar != NO_CHAR)
{
iNumberOfOutputFiles++;
printf("Start '%c' '%c' chunk (file %04d.txt)\n", *pcPivot, cChunkEndChar, iNumberOfOutputFiles);
sprintf((char*)pszOutName, "output/%04d.txt", iNumberOfOutputFiles);
if ((pFileOut = fopen(pszOutName, "w")) == NULL)
{
printf("What? How?\n");
fclose(pFileIn);
return 2;
}
bBreakInnerWhile = 1;
}
else if (++pcPivot == pcStop)
{
bBreakInnerWhile = 1;
}
}
/* End of "No, we are outside a chunk." */
}
else
{
/* Yes, we are inside a chunk. */
int bBreakInnerWhile;
bBreakInnerWhile = 0;
pcOutputStart = pcPivot;
while (bBreakInnerWhile == 0)
{
if (*pcPivot == cChunkStartChar)
{
/* Increase level of brackets/parantheses: */
iChunkNesting++;
}
else if (*pcPivot == cChunkEndChar)
{
/* Decrease level of brackets/parantheses: */
iChunkNesting--;
if (iChunkNesting == 0)
{
/* We are now outside chunk. */
bBreakInnerWhile = 1;
}
}
else
{
/* Detect atomic start: */
switch (*pcPivot)
{
case '"':
case '\'':
cAtomicChar = *pcPivot;
bBreakInnerWhile = 1;
break;
}
}
if (++pcPivot == pcStop) bBreakInnerWhile = 1;
}
iOutputLen = (size_t)(pcPivot-pcOutputStart);
fprintf(pFileOut, "%.*s", iOutputLen, pcOutputStart);
if (iChunkNesting == 0)
{
printf("File done.\n");
cChunkStartChar = cChunkEndChar = NO_CHAR;
fclose(pFileOut);
pFileOut = NULL;
}
/* End of "Yes, we are inside a chunk." */
}
}
}
if (cChunkStartChar != NO_CHAR)
{
printf("Chunk exceeds end-of-file. Exiting gracefully.\n");
fclose(pFileOut);
pFileOut = NULL;
}
if (iNumberOfOutputFiles == 0) printf("Nothing to do...\n");
else printf("All done.\n");
fclose(pFileIn);
return 0;
}
I've solved the nice-to-haves and one of the more-far-out-nice-to-haves.
To show this the input is a little more complex than the example in the question:
junk text
"atomic junk"
some junk text followed by a start bracket { here is the actual payload
more payload
'atomic payload { with start bracket that should be ignored'
nested start bracket { - all of this line is untouchable payload too
here is more payload
"this atomic has a literal double-quote \" inside"
"yet more atomic payload; this one's got a smiley ;-) and a heart <3"
end of nested bracket pair } - all of this line is untouchable payload too
this is payload too
"here's a totally unprovoked $ sign and an * asterisk"
} trailing junk
intermittent junk
<
payload that goes in second output file } mismatched end bracket should be ignored >
end junk
Resulting file output/0001.txt:
{ here is the actual payload
more payload
'atomic payload { with start bracket that should be ignored'
nested start bracket { - all of this line is untouchable payload too
here is more payload
"this atomic has a literal double-quote \" inside"
"yet more atomic payload; this one's got a smiley ;-) and a heart <3"
end of nested bracket pair } - all of this line is untouchable payload too
this is payload too
"here's a totally unprovoked $ sign and an * asterisk"
}
... and resulting file output/0002.txt:
<
payload that goes in second output file } mismatched end bracket should be ignored >
Thanks #dawg for your help :)

JFlex match nested comments as one token

In Mathematica a comment starts with (* and ends with *) and comments can be nested. My current approach of scanning a comment with JFlex contains the following code
%xstate IN_COMMENT
"(*" { yypushstate(IN_COMMENT); return MathematicaElementTypes.COMMENT;}
<IN_COMMENT> {
"(*" {yypushstate(IN_COMMENT); return MathematicaElementTypes.COMMENT;}
[^\*\)\(]* {return MathematicaElementTypes.COMMENT;}
"*)" {yypopstate(); return MathematicaElementTypes.COMMENT;}
[\*\)\(] {return MathematicaElementTypes.COMMENT;}
. {return MathematicaElementTypes.BAD_CHARACTER;}
}
where the methods yypushstate and yypopstate are defined as
private final LinkedList<Integer> states = new LinkedList();
private void yypushstate(int state) {
states.addFirst(yystate());
yybegin(state);
}
private void yypopstate() {
final int state = states.removeFirst();
yybegin(state);
}
to give me the opportunity to track how many nested levels of comment I'm dealing with.
Unfortunately, this results in several COMMENT tokens for one comment, because I have to match nested comment starts and comment ends.
Question: Is it possible with JFlex to use its API with methods like yypushback or advance() etc. to return exactly one token over the whole comment range, even if comments are nested?
It seems the bounty was uncalled for as the solution is so simple that I just did not consider it. Let me explain. When scanning a simple nested comment
(* (*..*) *)
I have to track, how many opening comment tokens I see so that I finally, on the last real closing comment can return the whole comment as one token.
What I did not realise was, that JFlex does not need to be told to advance to the next portion when it matches something. After careful review I saw that this is explained here but somewhat hidden in a section I didn't care for:
Because we do not yet return a value to the parser, our scanner proceeds immediately.
Therefore, a rule in flex file like this
[^\(\*\)]+ { }
reads all characters except those that could probably be a comment start/end and does nothing but it advances to the next token.
This means that I can simply do the following. In the YYINITIAL state, I have a rule that matches a beginning comment but it does nothing else then switch the lexer to the IN_COMMENT state. In particular, it does not return any token:
{CommentStart} { yypushstate(IN_COMMENT);}
Now, we are in the IN_COMMENT state and there, I do the same. I eat up all characters but never return a token. When I hit a new opening comment, I carefully push it onto a stack but do nothing. Only, when I hit the last closing comment, I know I'm leaving the IN_COMMENT state and this is the only point, where I, finally, return the token. Let's look at the rules:
<IN_COMMENT> {
{CommentStart} { yypushstate(IN_COMMENT);}
[^\(\*\)]+ { }
{CommentEnd} { yypopstate();
if(yystate() != IN_COMMENT)
return MathematicaElementTypes.COMMENT_CONTENT;
}
[\*\)\(] { }
. { return MathematicaElementTypes.BAD_CHARACTER; }
}
That's it. Now, no matter how deep your comment is nested, you will always get one single token that contains the entire comment.
Now, I'm embarrassed and I'm sorry for such a simple question.
Final note
If you are doing something like this, you have to remember that you only return a token from when you hit the correct closing "character". Therefore, you definitely should make a rule that catches the end of file. In IDEA that default behavior is to just return the comment token, so you need another line (useful or not, I want to end gracefully):
<<EOF>> { yyclearstack(); yybegin(YYINITIAL);
return MathematicaElementTypes.COMMENT;}
When I wrote the answer first I had even not realized that one of the existing answers was of the questioner itself. On the other hand, I seldom find a bounty in the rather small SO lex community. Thus, this seemed to me worth to learn enough Java and jflex to write a sample:
/* JFlex scanner: to recognize nested comments in Mathematica style
*/
%%
%{
/* counter for open (nested) comments */
int open = 0;
%}
%state IN_COMMENT
%%
/* any state */
"(*" { if (!open++) yybegin(IN_COMMENT); }
"*)" {
if (open) {
if (!--open) {
yybegin(YYINITIAL);
return MathematicaElementTypes.COMMENT;
}
} else {
/* or return MathematicaElementTypes.BAD_CHARACTER;
/* or: throw new Error("'*)' without '(*'!"); */
}
}
<IN_COMMENT> {
. |
\n { }
}
<<EOF>> {
if (open) {
/* This is obsolete if the scanner is instanced new for
* each invocation.
*/
open = 0; yybegin(IN_COMMENT);
/* Notify about syntax error, e.g. */
throw new Error("Premature end of file! ("
+ open + " open comments not closed.)");
}
return MathematicaElementTypes.EOF; /* just a guess */
}
There might be typos and stupid errors although I tried to be carefully and did my best.
As a "proof of concept" I leave my original implementation here which is done with flex and C/C++.
This scanner
handles comment (with printf())
echoes everything else.
My solution is essentially based on the fact that flex rules may end with break or return. Therefore, the token is simply not returned until the rule for the pattern is matched closing the outmost comment. Contents in comments is simply "recorded" in a buffer – in my case a std::string.
(AFAIK, string is even a built-in type in Java. Therefore, I decided to mix C and C++ which I usally wouldn't.)
My source scan-nested-comments.l:
%{
#include <cstdio>
#include <string>
// counter for open (nested) comments
static int open = 0;
// buffer for collected comments
static std::string comment;
%}
/* make never interactive (prevent usage of certain C functions) */
%option never-interactive
/* force lexer to process 8 bit ASCIIs (unsigned characters) */
%option 8bit
/* prevent usage of yywrap */
%option noyywrap
%s IN_COMMENT
%%
"(*" {
if (!open++) BEGIN(IN_COMMENT);
comment += "(*";
}
"*)" {
if (open) {
comment += "*)";
if (!--open) {
BEGIN(INITIAL);
printf("EMIT TOKEN COMMENT(lexem: '%s')\n", comment.c_str());
comment.clear();
}
} else {
printf("ERROR: '*)' without '(*'!\n");
}
}
<IN_COMMENT>{
. |
"\n" { comment += *yytext; }
}
<<EOF>> {
if (open) {
printf("ERROR: Premature end of file!\n"
"(%d open comments not closed.)\n", open);
return 1;
}
return 0;
}
%%
int main(int argc, char **argv)
{
if (argc > 1) {
yyin = fopen(argv[1], "r");
if (!yyin) {
printf("Cannot open file '%s'!\n", argv[1]);
return 1;
}
} else yyin = stdin;
return yylex();
}
I compiled it with flex and g++ in cygwin on Windows 10 (64 bit):
$ flex -oscan-nested-comments.cc scan-nested-comments.l ; g++ -o scan-nested-comments scan-nested-comments.cc
scan-nested-comments.cc:398:0: warning: "yywrap" redefined
^
scan-nested-comments.cc:74:0: note: this is the location of the previous definition
^
$
The warning appears due to %option noyywrap. I guess it does not mean any harm and can be ignored.
Now, I made some tests:
$ cat >good-text.txt <<EOF
> Test for nested comments.
> (* a comment *)
> (* a (* nested *) comment *)
> No comment.
> (* a
> (* nested
> (* multiline *)
> *)
> comment *)
> End of file.
> EOF
$ cat good-text | ./scan-nested-comments
Test for nested comments.
EMIT TOKEN COMMENT(lexem: '(* a comment *)')
EMIT TOKEN COMMENT(lexem: '(* a (* nested *) comment *)')
No comment.
EMIT TOKEN COMMENT(lexem: '(* a
(* nested
(* multiline *)
*)
comment *)')
End of file.
$ cat >bad-text-1.txt <<EOF
> Test for wrong comment.
> (* a comment *)
> with wrong nesting *)
> End of file.
> EOF
$ cat >bad-text-1.txt | ./scan-nested-comments
Test for wrong comment.
EMIT TOKEN COMMENT(lexem: '(* a comment *)')
with wrong nesting ERROR: '*)' without '(*'!
End of file.
$ cat >bad-text-2.txt <<EOF
> Test for wrong comment.
> (* a comment
> which is not closed.
> End of file.
> EOF
$ cat >bad-text-2.txt | ./scan-nested-comments
Test for wrong comment.
ERROR: Premature end of file!
(1 open comments not closed.)
$
The Java traditional comment is defined in the sample grammar with
TraditionalComment = "/*" [^*] ~"*/" | "/*" "*"+ "/"
I suppose this expression should work for Mathematica comments too.

Visual Studio code metrics misreporting lines of code

The code metrics analyser in Visual Studio, as well as the code metrics power tool, report the number of lines of code in the TestMethod method of the following code as 8.
At the most, I would expect it to report lines of code as 3.
[TestClass]
public class UnitTest1
{
private void Test(out string str)
{
str = null;
}
[TestMethod]
public void TestMethod()
{
var mock = new Mock<UnitTest1>();
string str;
mock.Verify(m => m.Test(out str));
}
}
Can anyone explain why this is the case?
Further info
After a little more digging I've found that removing the out parameter from the Test method and updating the test code causes LOC to be reported as 2, which I believe is correct. The addition of out causes the jump, so it's not because of braces or attributes.
Decompiling the DLL with dotPeek reveals a fair amount of additional code generated because of the out parameter which could be considered 8 LOC, but removing the parameter and decompiling also reveals generated code, which could be considered 5 LOC, so it's not simply a matter of VS counting compiler generated code (which I don't believe it should do anyway).
There are several common definitions of 'Lines Of Code' (LOC). Each tries to bring some sense to what I think of as an almost meaningless metric. For example google of effective lines of code (eLOC).
I think that VS is including the attribute as part of the method declaration and is trying to give eLOC by counting statements and even braces. One possiblity is that 'm => m.Test(out str)' is being counted as a statement.
Consider this:
if (a > 1 &&
b > 2)
{
var result;
result = GetAValue();
return result;
}
and this:
if (a> 1 && b >2)
return GetAValue();
One definition of LOC is to count the lines that have any code. This may even include braces. In such an extreme simplistic definition the count varies hugely on coding style.
eLOC tries to reduce or eliminate the influence of code style. For example, as may the case here, a declaration may be counted as a 'line'. Not justifying it, just explaining.
Consider this:
int varA = 0;
varA = GetAValue();
and this:
var varA = GetAValue();
Two lines or one?
It all comes down to what is the intent. If it is to measure how tall a monitor you need then perhaps use a simple LOC. If the intent is to measure complexity then perhaps counting code statements is better such as eLOC.
If you want to measure complexity then use a complexity metric like cyclomatic complexity. Don't worry about how VS is measuring LOC as, i think, it is a useless metric anyway.
With the tool NDepend we get a # Lines of Code (LoC) of 2 for TestMethod(). (Disclaimer I am one of the developers of this tool). I wrote an article about How do you count your number of Lines Of Code (LOC) ? that is shedding light on what is logical LoC, and how all .NET LoC counting tooling rely on the PDB sequence points technology.
My guess concerning this LoC value of 8 provided by VS metric, is that it includes the LoC of the method generated by the lambda expression + it includes the PDB sequences points related to open/ending braces (which NDepend doesn't). Also lot of gymnastic is done by the compiler to do what is called capturing the local variable str, but this shouldn't impact the #LoC that is inferred from the PDB sequence points.
Btw, I wrote 2 others related LoC articles:
Why is it useful to count the number of Lines Of Code (LOC) ?
Mythical man month : 10 lines per developer day
I was wondering about the Visual Studio line counting and why what I was seeing wasn't what was being reported. So I wrote a small C# console program to count pure lines of code and write the results to a CSV file (see below).
Open a new solution, copy and paste it into the Program.cs file, build the executable, and then you're ready to go. It's a .Net 3.5 application. Copy it into the topmost directory of your code base. Open a command window and run the executable. You get two prompts, first for name of the program/subsystem, and for any extra file types you want to analyze. It then writes the results to a CSV file in the current directory. Nice simple thing for your purposes or to hand to management.
Anyhoo, here it is, FWIW, and YMMV:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.IO;
namespace CodeMetricsConsole
{
class Program
{
// Concept here is that the program has a list of file extensions to do line counts on; it
// gets any extra extensions at startup from the user. Then it gets a list of files based on
// each extension in the current directory and all subdirectories. Then it walks through
// each file line by line and will display counts for that file and for that file extension.
// It writes that information to a CSV file in the current directory. It uses regular expressions
// on each line of each file to figure out what it's looking at, and how to count it (i.e. is it
// a line of code, a single or multi line comment, a multi-line string, or a whitespace line).
//
static void Main(string[] args)
{
try
{
Console.WriteLine(); // spacing
// prompt user for subsystem or application name
String userInput_subSystemName;
Console.Write("Enter the name of this application or subsystem (required): ");
userInput_subSystemName = Console.ReadLine();
if (userInput_subSystemName.Length == 0)
{
Console.WriteLine("Application or subsystem name required, exiting.");
return;
}
Console.WriteLine(); // spacing
// prompt user for additional types
String userInput_additionalFileTypes;
Console.WriteLine("Default extensions are asax, css, cs, js, aspx, ascx, master, txt, jsp, java, php, bas");
Console.WriteLine("Enter a comma-separated list of additional file extensions (if any) you wish to analyze");
Console.Write(" --> ");
userInput_additionalFileTypes = Console.ReadLine();
// tell user processing is starting
Console.WriteLine();
Console.WriteLine("Getting LOC counts...");
Console.WriteLine();
// the default file types to analyze - hashset to avoid duplicates if the user supplies extensions
HashSet allowedExtensions = new HashSet { "asax", "css", "cs", "js", "aspx", "ascx", "master", "txt", "jsp", "java", "php", "bas" };
// Add user-supplied types to allowedExtensions if any
String[] additionalFileTypes;
String[] separator = { "," };
if (userInput_additionalFileTypes.Length > 0)
{
// split string into array of additional file types
additionalFileTypes = userInput_additionalFileTypes.Split(separator, StringSplitOptions.RemoveEmptyEntries);
// walk through user-provided file types and append to default file types
foreach (String ext in additionalFileTypes)
{
try
{
allowedExtensions.Add(ext.Trim()); // remove spaces
}
catch (Exception e)
{
Console.WriteLine("Exception: " + e.Message);
}
}
}
// summary file to write to
String summaryFile = userInput_subSystemName + "_Summary.csv";
String path = Directory.GetCurrentDirectory();
String pathAndFile = path + Path.DirectorySeparatorChar + summaryFile;
// regexes for the different line possibilities
Regex oneLineComment = new Regex(#"^\s*//"); // match whitespace to two slashes
Regex startBlockComment = new Regex(#"^\s*/\*.*"); // match whitespace to /*
Regex whiteSpaceOnly = new Regex(#"^\s*$"); // match whitespace only
Regex code = new Regex(#"\S*"); // match anything but whitespace
Regex endBlockComment = new Regex(#".*\*/"); // match anything and */ - only used after block comment detected
Regex oneLineBlockComment = new Regex(#"^\s*/\*.*\*/.*"); // match whitespace to /* ... */
Regex multiLineStringStart = new Regex("^[^\"]*#\".*"); // match #" - don't match "#"
Regex multiLineStringEnd = new Regex("^.*\".*"); // match double quotes - only used after multi line string start detected
Regex oneLineMLString = new Regex("^.*#\".*\""); // match #"..."
Regex vbaComment = new Regex(#"^\s*'"); // match whitespace to single quote
// Uncomment these two lines to test your regex with the function testRegex() below
//new Program().testRegex(oneLineMLString);
//return;
FileStream fs = null;
String line = null;
int codeLineCount = 0;
int commentLineCount = 0;
int wsLineCount = 0;
int multiLineStringCount = 0;
int fileCodeLineCount = 0;
int fileCommentLineCount = 0;
int fileWsLineCount = 0;
int fileMultiLineStringCount = 0;
Boolean inBlockComment = false;
Boolean inMultiLineString = false;
try
{
// write to summary CSV file, overwrite if exists, don't append
using (StreamWriter outFile = new StreamWriter(pathAndFile, false))
{
// outFile header
outFile.WriteLine("filename, codeLineCount, commentLineCount, wsLineCount, mlsLineCount");
// walk through files with specified extensions
foreach (String allowed_extension in allowedExtensions)
{
String extension = "*." + allowed_extension;
// reset accumulating values for extension
codeLineCount = 0;
commentLineCount = 0;
wsLineCount = 0;
multiLineStringCount = 0;
// Get all files in current directory and subdirectories with specified extension
String[] fileList = Directory.GetFiles(Directory.GetCurrentDirectory(), extension, SearchOption.AllDirectories);
// walk through all files of this type
for (int i = 0; i < fileList.Length; i++)
{
// reset values for this file
fileCodeLineCount = 0;
fileCommentLineCount = 0;
fileWsLineCount = 0;
fileMultiLineStringCount = 0;
inBlockComment = false;
inMultiLineString = false;
try
{
// open file
fs = new FileStream(fileList[i], FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
using (TextReader tr = new StreamReader(fs))
{
// walk through lines in file
while ((line = tr.ReadLine()) != null)
{
if (inBlockComment)
{
if (whiteSpaceOnly.IsMatch(line))
{
fileWsLineCount++;
}
else
{
fileCommentLineCount++;
}
if (endBlockComment.IsMatch(line)) inBlockComment = false;
}
else if (inMultiLineString)
{
fileMultiLineStringCount++;
if (multiLineStringEnd.IsMatch(line)) inMultiLineString = false;
}
else
{
// not in a block comment or multi-line string
if (oneLineComment.IsMatch(line))
{
fileCommentLineCount++;
}
else if (oneLineBlockComment.IsMatch(line))
{
fileCommentLineCount++;
}
else if ((startBlockComment.IsMatch(line)) && (!(oneLineBlockComment.IsMatch(line))))
{
fileCommentLineCount++;
inBlockComment = true;
}
else if (whiteSpaceOnly.IsMatch(line))
{
fileWsLineCount++;
}
else if (oneLineMLString.IsMatch(line))
{
fileCodeLineCount++;
}
else if ((multiLineStringStart.IsMatch(line)) && (!(oneLineMLString.IsMatch(line))))
{
fileCodeLineCount++;
inMultiLineString = true;
}
else if ((vbaComment.IsMatch(line)) && (allowed_extension.Equals("txt") || allowed_extension.Equals("bas"))
{
fileCommentLineCount++;
}
else
{
// none of the above, thus it is a code line
fileCodeLineCount++;
}
}
} // while
outFile.WriteLine(fileList[i] + ", " + fileCodeLineCount + ", " + fileCommentLineCount + ", " + fileWsLineCount + ", " + fileMultiLineStringCount);
fs.Close();
fs = null;
} // using
}
finally
{
if (fs != null) fs.Dispose();
}
// update accumulating values
codeLineCount = codeLineCount + fileCodeLineCount;
commentLineCount = commentLineCount + fileCommentLineCount;
wsLineCount = wsLineCount + fileWsLineCount;
multiLineStringCount = multiLineStringCount + fileMultiLineStringCount;
} // for (specific file)
outFile.WriteLine("Summary for: " + extension + ", " + codeLineCount + ", " + commentLineCount + ", " + wsLineCount + ", " + multiLineStringCount);
} // foreach (all files with specified extension)
} // using summary file streamwriter
Console.WriteLine("Analysis complete, file is: " + pathAndFile);
} // try block
catch (Exception e)
{
Console.WriteLine("Error: " + e.Message);
}
}
catch (Exception e2)
{
Console.WriteLine("Error: " + e2.Message);
}
} // main
// local testing function for debugging purposes
private void testRegex(Regex rx)
{
String test = " asdfasd asdf #\" adf ++--// /*\" ";
if (rx.IsMatch(test))
{
Console.WriteLine(" -->| " + rx.ToString() + " | matched: " + test);
}
else
{
Console.WriteLine("No match");
}
}
} // class
} // namespace
Here's how it works:
the program has a set of the file extensions you want to analyze.
It walks through each extension in the set, getting all files of that type in the current and all subdirectories.
It selects each file, goes through each line of that file, compares each line to a regex to figure out what it's looking at, and increments the line count after it figures out what it's looking at.
If a line isn't whitespace, a single or multi-line comment, or a multi-line string, it counts it as a line of code. It reports all the counts for each of those types of lines (code, comments, whitespace, multi-line strings) and writes them to a CSV file. No need to explain why Visual Studio did or did not count something as a line of code.
Yes, there are three loops embedded in each other (O(n-cubed) O_O ) but it's just a simple, standalone developer tool, and the biggest code base I've run it on was about 350K lines and it took like 10 seconds to run on a Core i7.
Edit: Just ran it on the Firefox 12 code base, about 4.3 million lines (3.3M code, 1M comments), about 21K files, with an AMD Phenom processor - took 7 minutes, watched the performance tab in Task Manager, no stress. FYI.
My attitude is if I wrote it to be part of an instruction fed to a compiler, it's a line of code and should be counted.
It can easily be customized to ignore or count whatever you want (brackets, namespaces, the includes at the top of the file, etc). Just add the regex, test it with the function that's right there below the regexes, then update the if statement with that regex.

Resources