How should I do program in lex (or flex) for removing nested comments from text and print just the text which is not in comments?
I should probably somehow recognize states when I am in comment and number of starting "tags" of block comment.
Lets have rules:
1.block comment
/*
block comment
*/
2. line comment
// line comment
3. Comments can be nested.
Example 1
show /* comment /* comment */ comment */ show
output:
show show
Example 2
show /* // comment
comment
*/
show
output:
show
show
Example 3
show
///* comment
comment
// /*
comment
//*/ comment
//
comment */
show
output:
show
show
You got the theory right. Here's a simple implementation; could be improved.
%x COMMENT
%%
%{
int comment_nesting = 0;
%}
"/*" BEGIN(COMMENT); ++comment_nesting;
"//".* /* // comments to end of line */
<COMMENT>[^*/]* /* Eat non-comment delimiters */
<COMMENT>"/*" ++comment_nesting;
<COMMENT>"*/" if (--comment_nesting == 0) BEGIN(INITIAL);
<COMMENT>[*/] /* Eat a / or * if it doesn't match comment sequence */
/* Could have been .|\n ECHO, but this is more efficient. */
([^/]*([/][^/*])*)* ECHO;
%%
This is exactly what you need : yy_push_state(COMMENT) Its uses a stack to store our states which comes handy in nested situations.
I am afraid that #rici 's answer might be wrong. First we need to record line no and might change the file line directive later. Second giving open_sign and close_sign. We have following principles:
1) using an integer for stack control: push for open sign, popup for close sign
2) eat up CHARACTER BEFORE EOF and close sign WITHOUT open sign inside
<comments>{open} {no_open_sign++;}
<comments>\n {curr_lineno++;}
<comments>[^({close})({open})(EOF)] /*EAT characters by doing nothing*/
3) Errors might happen when no_open_sign down to zero, hence
<comments>{close} similar as above post
4) EOF should not be inside the string, hence you need a rule
<comments>(EOF) {return ERROR_TOKEN;}
to make it more robust, you also need to have another close checking rule out side of
And in practice, you should use negative look before and look behind regular expression gramma if your lexical analyzer supports it.
Related
I wrote a correctly working sed script which replaces multiple spaces with single space between tokens (it skips lines with # or //) :
#!/bin/sed -f
/.*#/ !{
/\/\//n
# handle more than one space between tokens
s/\([^ ]\)\s\+/\1 /g
}
i run it on ubuntu like this: ./spaces.sed < spa.txt
spa.txt:
/** spa.txt text
date : some date
hih+jjhh jgjg
if ( hjh>=hjhjh )
y **/
# this is a comment
// this is a comment
lines begins here ;
/****** this line is comment ****/
some more lines
// again comment
more lines words
/** again multi line co
mmment it
comment line
follows till here**/
file ends
now i want to add the functionality that script should skip over lines between a pattern (pattern can be distributed in multiple lines). This is the pattern: /* and */
I tried many things but of no use:
#!/bin/sed -f
/.*#/ !{
/\/\*/,/\*\// {
/\/\*/n #it skips successfully the /* line
n #also skips next line
/\*\// !{
}
}
/\/\//n
# handle more than one space between tokens
s/\([^ ]\)\s\+/\1 /g
}
but script isn't working as expected.
Expected output:
/** spa.txt text
date : some date
hih+jjhh jgjg
if ( hjh>=hjhjh )
y **/
# this is a comment
// this is a comment
lines begins here ;
/****** this line is comment ****/
some more lines
// again comment
more lines words
/** again multi line co
mmment it
comment line
follows till here**/
file ends
suggestions?
Thanks
I'd re-engineer the script a bit, to handle # and // comments on their own. With the /* … */ comments, you have to deal with single-line and multi-line variants separately. I'd also use the [[:space:]] notation to spot spaces or tabs. I prefer to avoid backslashes (an aversion caused by working with troff in the days of my youth — if you've never needed 16 backslashes in a row to get the desired effect, you've not suffered enough), so I use \%…% to choose the % character as the search marker instead of / (which means there's no need to escape the slashes in the pattern with a backslash), and I use [*] instead of \*. The { p; d; } notation prints the current line and then deletes it and moves onto the next line. (Using n appends the next line to the current line; it isn't what you need.). The second semicolon isn't required by GNU sed but is by BSD (macOS) sed. The spaces in those braces are optional but make it easier to read.
Putting this together, you might have spaces.sed like this:
#!/bin/sed -f
# Comments with a #
/#/ { p; d; }
# Comments with //
\%//% { p; d; }
# Single line /* ... */ comments
\%/[*].*[*]/% { p; d; }
# Multi-line /* ... */ comments
\%/[*]%,\%[*]/% { p; d; }
s/\([^[:space:]]\)[[:space:]]\{2,\}/\1 /g
On your sample data (thanks for including it!), this produces:
/** spa.txt text
date : some date
hih+jjhh jgjg
if ( hjh>=hjhjh )
y **/
# this is a comment
// this is a comment
lines begins here ;
/****** this line is comment ****/
some more lines
// again comment
more lines words
/** again multi line co
mmment it
comment line
follows till here**/
file ends
That looks like what you wanted.
Limitations
It doesn't remove multiple spaces at the start of a line.
the leading blanks are not removed.
If you have a line with multiple spaces and // or #, the multiple spaces remain:
these spaces // survive
so do # these
If you have multiple single line comments on a single line, you don't get spaces removed in between them:
/* these */ spaces are not /* removed */
If you have a single-line comment and the start of a multi-line comment on a single line, the multi-line comment is not spotted. Similarly, if you have a multi-line comment that ends on a line and has a single-line comment starting after it, then if there are any multiple spaces between the end of the one comment and the start of the next, they are not handled.
/* this */ is not /* handled
very well */ nor are these /* spaces */
This doesn't deal with the subtleties of backslash-newline in the middle of a start or end comment symbol, nor with backslash-newline at the end of a // comment. Only brain-dead programs (or programmers) produce such comments, so it shouldn't be a real problem. Fortunately, you're not writing a compiler; those have to deal with the nonsense. And don't get me started on trigraphs!
It doesn't handle comment-like sequences inside strings (or multi-character character constants):
"/* this is not a comment */"
'/*', ' ', '*/'
However, most of these issues are subtle enough that you're probably OK without dealing with them. If you must deal with them, then you need a program, not a sed script (assuming you value your sanity).
I'm working on a lexer for Ruby. Such a lexer needs to clearly
distinguish divide '/' operators from regex /..../ operands.
Lexers are nicest to build when they are context free (stateless)
with respect to lexing-the-next token.
Some program text that starts with "/" might be:
... / abc*(foo(def,bar[q-z]*)+sam) / ...
You can't tell if the '/' symbol is a divide or the start of regexp.
So clearly Ruby must be looking at the context, or it must have rule
to decide when it is ambiguous. What's the rule?
[one possibility: it only allows them where divide cannot occur, e.g, after
when [ ( , #{ { if elseif != = !~ + , << and or not
(Edit 8/24/2015: extended the above list)
Does that cover everything? Or it is something entirely different?]
The Ruby lexer emits completely different tokens for a division operator and for the start of a regex (one is '/', the other tREGEXP_BEG). So the parser has no idea that the two actually use the same source text.
How does the lexer know which token to emit? See parse.y:8451 from the Ruby source.
The parser_params struct which is passed to the lexer has a member called lex.state. This is a bitfield, with each bit indicating something about the lexer state. The individual bits are called BEG, END, ENDARG, ENDFN, ARG, CMDARG, MID, FNAME, DOT, CLASS, LABEL, and LABELED.
When the lexer sees a '/' character, it emits tREGEXP_BEG if...
The lexer state is true for both ARG and LABELED, or
The lexer state is true for any one of BEG, MID, or CLASS.
Otherwise, it emits a division operator token.
So what do the states actually mean? The Ruby source contains the following comments on them:
EXPR_BEG_bit, /* ignore newline, +/- is a sign. */
EXPR_END_bit, /* newline significant, +/- is an operator. */
EXPR_ENDARG_bit, /* ditto, and unbound braces. */
EXPR_ENDFN_bit, /* ditto, and unbound braces. */
EXPR_ARG_bit, /* newline significant, +/- is an operator. */
EXPR_CMDARG_bit, /* newline significant, +/- is an operator. */
EXPR_MID_bit, /* newline significant, +/- is an operator. */
EXPR_FNAME_bit, /* ignore newline, no reserved words. */
EXPR_DOT_bit, /* right after `.' or `::', no reserved words. */
EXPR_CLASS_bit, /* immediate after `class', no here document. */
EXPR_LABEL_bit, /* flag bit, label is allowed. */
EXPR_LABELED_bit, /* flag bit, just after a label. */
Whenever the lexer emits a token, depending on the current lexer state, the token which was lexed, and possibly what the lexer sees next in the source text (it does look ahead in a number of places), it may move to a new state.
Some of the states are only entered after lexing a reserved keyword. For example, EXPR_MID is entered after lexing break, next, rescue, or return.
This is because of the way how the parser is defined. Having a look at BNF definition of Ruby you can see that the division operation (in the ARGS section) is defined before the definition of a REGEXP. That's why the division operation has a higher precedence than a regexp.
Meaning, if the ruby parser stumbles upon a section that resolves to
ARG / ARG
it will treat it as a division and goes further.
Walking trough a flex/bison tutorial will enlighten you! (Plus it is a fun)
So the title might be a little bit misleading, but I can't think of any better way to phrase it.
Basically, I'm writing a lexical-scanner using cygwin/lex. A part of the code reads a token /* . It the goes into a predefined state C_COMMENT, and ends when C_COMMENT"/*". Below is the actual code
"/*" {BEGIN(C_COMMENT); printf("%d: /*", linenum++);}
<C_COMMENT>"*/" { BEGIN(INITIAL); printf("*/\n"); }
<C_COMMENT>. {printf("%s",yytext);}
The code works when the comment is in a single line, such as
/* * Example of comment */
It will print the current line number, with the comment behind. But it doesn't work if the comment spans multiple lines. Rewriting the 3rd line into
<C_COMMENT>. {printf("%s",yytext);
printf("\n");}
doesn't work. It will result in \n printed for every letter in the comment. I'm guessing it has something to do with C having no strings or maybe I'm using the states wrong.
Hope someone will be able to help me out :)
Also if there's any other info you need, just ask, and I'll provide.
The easiest way to echo the token scanned by a pattern is to use the special action ECHO:
"/*" { printf("%d: ", linenum++); ECHO; BEGIN(C_COMMENT); }
<C_COMMENT>"*/" { ECHO; BEGIN(INITIAL); }
<C_COMMENT>. { ECHO; }
None of the above rules matches a newline inside a comment, because in (f)lex . doesn't match newlines:
<C_COMMENT>\n { linenum++; ECHO; }
A faster way of recognizing C comments is with a single regular expression, although it's a little hard to read:
[/][*][^*]*[*]+([^/*][^*][*]+)*[/]
In this case, you'll have to rescan the comment to count newlines, unless you get flex to do the line number counting.
flex scanners maintain a line number count in yylineno, if you request that feature (using %option yylineno). It's often more efficient and always more reliable than keeping the count yourself. However, in the action, the value of yylineno is the line number count at the end of the pattern, not at the beginning, which can be misleading for multiline patterns. A common workaround is to save the value of yylineno in another variable at the beginning of the token scan.
I have such task to do but I have no idea how to write it with sed function.
I have to change the way on commenting in a file from:
//something6
//something4
//something5
//something3
//something2
to
/*something6
* something4
* something5
* something3
* something2*/
from
//something6
//something4
//something5
//something3
//something2
to
/*something6
something4
something5
something3
something2*/
from
/*something6
* something4
* something5
* something3
* something2*/
to
//something6
//something4
//something5
//something3
//something2
from
/*something6
something4
something5
something3
something2*/
to
//something6
//something4
//something5
//something3
//something2
Those 4 patterns must be made by sed function (I guess but not sure about that).
Tried doing it but without luck. I can replace single words to other ones but how to change the way of commenting? No clue. Would be very gratefull for help and assisstance.
Given that the task is:
Please write a script that allows to change style of comments in source files for example : /* .... */ goes to // .... The style of comment is an argument of the script.
I have tried to use just typical:
sed -i 's/'"$lookingfor"'/'"$changing"'/g' $filename
In this context, either $lookingfor or $changing or both will contain slashes, so that simple formulation doesn't work, as you correctly observe.
The conversion of // comments to /* comments is easy as long as you know that you can choose an arbitrary character to separate the sections of the s/// command, such as %. So, for example, you could use:
sed -i.bak -e 's%// *\(.*\)%/*\1 */%'
This looks for a double-slash followed by zero or more spaces and anything and converts it to /* anything */.
The conversion of /* comments is much harder. There are two cases to be concerned about:
/* A single line comment */
/*
** A multiline comment
*/
That's before you get into:
/* OK */ "/* OK */" /* Really?! */
which is a single line containing two comments and a string containing text that outside a string would look like a comment. This I am studiously ignoring! Or, more accurately, I am studiously deciding that it will be OK when converted to:
// OK */ "/* OK */" /* Really?!
which isn't the same at all, but serves you right for writing convoluted C in the first place.
You can deal with the first case with something like:
sed -e '\%/\*\(.*\)\*/% { s%%//\1%; n; }'
I have the grouping braces and the n command in there so that single line comments don't also match the second case:
-e '\%/\*%,\%\*/% {
\%/\*% { s%/\*\(.*\)%//\1%; n; }
\%\*/% { s%\(.*\)\*/%//\1%; n; }
s%^\( *\)%\1//%
}'
The first line selects a range of lines between one matching /* and the next matching */. The \% tells sed to use the % instead of / as the search delimiter. There are three operations within the outer grouping { … }:
Convert /*anything into //anything and start on the next line.
Convert anything*/ into //anything and start on the next line.
Convert any other line so that it preserves leading blanks but puts // after them.
This is still ridiculously easy to subvert if the comments are maliciously formed. For example:
/* a comment */ int x = 0;
is mapped to:
// a comment int x = 0;
Fixing problems like that, and the example with a string, is something I'd not even start trying in sed. And that's before you get onto the legal but implausible C comments, like:
/\
\
* comment
*\
\
/
/\
/\
noisiness \
commentary \
continued
Which contains just two comments (but does contain two comments!). And before you decide to deal with trigraphs (??/ is a backslash). Etc.
So, a moderate approximation to a C to C++ comment conversion is:
sed -e '\%/\*\(.*\)\*/% { s%%//\1%; n; }' \
-e '\%/\*%,\%\*/% {
\%/\*% { s%/\*\(.*\)%//\1%; n; }
\%\*/% { s%\(.*\)\*/%//\1%; n; }
s%^\( *\)%\1//%
}' \
-i.bak "$#"
I'm assuming you aren't using a C shell; if you are, you need more backslashes at the ends of the lines in the script so that the multi-line single-quoted sed command is treated correctly.
How do I change the comments to look like /* */ instead of // in VS 2008?
// this is a line comment, it will only comment this line
// for the next line you need to repeat //
/* this is a block comment
you can do all sort of stuff here
and you won't have to worry about beginning the line with some special chars
until the end*/
Since those two types of comments are a bit different I would say that you should both of them. It's not an error to have line and block comments in the same file.
I suppose you could run a regexp replace that will replace // on the beginning of the line with /* and add */ at the end but you will end up with something like this
/* first line comment */
/* second line comment */
/* third line comment */
/* forth line comment */