ANTLR3 not ignoring comments that begin at the first character of a file - comments

Sorry if any terminology is off, just started using antlr recently.
Here's the antlr grammar that ignores multi-line comments:
COMMENT : '/*' .* '*/';
SPACE : (' ' | '\t' | '\r' | '\n' | COMMENT)+ {$channel = HIDDEN;} ;
Here's a comment beginning at the first character of a file I'd like to compile:
/*
This is a comment
*/
Here's the error I get:
[filename] line 252:0 no viable alternative at character '<EOF>'
[filename] line 1:1 no viable alternative at input '*'
However, if I put a space in front of the comment, like so:
/*
This is a comment
*/
It compiles fine. Any ideas?

For ignoring multilines comments:
ML_COMMENT
: '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
EDIT:
Maybe it's not because of your LEXER but because of your Parser. From lexer, with $channel=HIDDEN you are telling all these elements not to be passed to Parser. This is why parser finds EOF at first. You are sending nothing!
If you write a whitespace as the first character, parser receives something and it's able to process an input...
This should be your issue!!
I hope this would help you!

Related

why does a comma "," get counted in [.] type expression in antlr lexer

I am making a grammar for bash scripts. I am facing a problem while tokenising the "," symbol. The following grammar tokenises it as <BLOB> while I expect it to be tokenised as <OTHER>.
grammar newgram;
code : KEY (BLOB)+ (EOF | '\n')+;
KEY : 'wget';
BLOB : [a-zA-Z0-9#!$^%*&+-.]+?;
OTHER : .;
However, if I make BLOB to be [a-zA-Z0-9#!$^%*&+.-]+?;, then it is tokenised as <OTHER>.
I cannot understand why is it happening like this.
In the former case, the characters : and / are also tokenised as <OTHER>, so I do not see a reason for ,, to be marked <BLOB>.
Input I am tokenising, wget -o --quiet https,://www.google.com
The output I am receiving with the mentioned grammar,
[#0,0:3='wget',<'wget'>,1:0]
[#1,4:4=' ',<OTHER>,1:4]
[#2,5:5='-',<BLOB>,1:5]
[#3,6:6='o',<BLOB>,1:6]
[#4,7:7=' ',<OTHER>,1:7]
[#5,8:8='-',<BLOB>,1:8]
[#6,9:9='-',<BLOB>,1:9]
[#7,10:10='q',<BLOB>,1:10]
[#8,11:11='u',<BLOB>,1:11]
[#9,12:12='i',<BLOB>,1:12]
[#10,13:13='e',<BLOB>,1:13]
[#11,14:14='t',<BLOB>,1:14]
[#12,15:15=' ',<OTHER>,1:15]
[#13,16:16='h',<BLOB>,1:16]
[#14,17:17='t',<BLOB>,1:17]
[#15,18:18='t',<BLOB>,1:18]
[#16,19:19='p',<BLOB>,1:19]
[#17,20:20='s',<BLOB>,1:20]
[#18,21:21=',',<BLOB>,1:21]
[#19,22:22=':',<OTHER>,1:22]
[#20,23:23='/',<OTHER>,1:23]
[#21,24:24='/',<OTHER>,1:24]
[#22,25:25='w',<BLOB>,1:25]
[#23,26:26='w',<BLOB>,1:26]
[#24,27:27='w',<BLOB>,1:27]
[#25,28:28='.',<BLOB>,1:28]
[#26,29:29='g',<BLOB>,1:29]
[#27,30:30='o',<BLOB>,1:30]
[#28,31:31='o',<BLOB>,1:31]
[#29,32:32='g',<BLOB>,1:32]
[#30,33:33='l',<BLOB>,1:33]
[#31,34:34='e',<BLOB>,1:34]
[#32,35:35='.',<BLOB>,1:35]
[#33,36:36='c',<BLOB>,1:36]
[#34,37:37='o',<BLOB>,1:37]
[#35,38:38='m',<BLOB>,1:38]
[#36,39:39='\n',<'
'>,1:39]
[#37,40:39='<EOF>',<EOF>,2:0]
line 1:4 extraneous input ' ' expecting BLOB
line 1:7 extraneous input ' ' expecting {<EOF>, '
', BLOB}
line 1:15 extraneous input ' ' expecting {<EOF>, '
', BLOB}
line 1:22 extraneous input ':' expecting {<EOF>, '
', BLOB}
As already mentioned in a comment, the - in +-. inside your character class is interpreted as a range operator. And the , is inside that range. Escape it like this: [a-zA-Z0-9#!$^%*&+\-.]+?
Also, a trailing [ ... ]+? at the end of a lexer rule will always match a single character. So [a-zA-Z0-9#!$^%*&+\-.]+? can just as well be written as [a-zA-Z0-9#!$^%*&+\-.]

tr command: strange behavior with | and \

Let's say I have a file test.txt with contents:
+-foo.bar:2.4
| bar.foo:1.1:test
\| hello.goobye:3.3.3
\|+- baz.yeah:4
I want to use the tr command to delete all instances of the following set of characters:
{' ', '+', '-', '|', '\'}
Done some pretty extensive research on this but found no clear/concise answers.
This is the command that works:
input:
cat test.txt | tr -d "[:blank:]|\\\+-"
output:
foo.bar:2.4
bar.foo:1.1:test
hello.goobye:3.3.3
baz.yeah:4
I experimented with many combinations of that set and I found out that the '-' was being treated as a range indicator (like... [a-z]) and therefore must be put at the end. But I have two main questions:
1) Why must the backslash be double escaped in order to be included in the set?
2) Why does putting the '|' at the end of the set string cause the tr program to delete everything in the file except for trailing new line characters?
Like this:
tr -d '\-|\\+[:blank:] ' < file
You have to escape the - because it is used for denoting ranges of characters like:
tr -d '1-5'
and must therefore being escaped if you mean a literal hyphen. You can also put it at the end. (learned that, thanks! :) )
Furthermore the \ must be escaped when you mean a literal \ because it has a special meaning needed for escape sequences.
The remaining characters must not being escaped.
Why must the \ being doubly escaped in your example?
It's because you are using a "" (double quoted) string to quote the char set. A double quoted string will be interpreted by the shell, a \\ in a double quoted string means a literal \. Try:
echo "\+"
echo "\\+"
echo "\\\+"
To avoid to doubly escape the \ you can just use single quotes as in my example above.
Why does putting the '|' at the end of the set string cause the tr program to delete everything in the file except for trailing new line characters?
Following CharlesDuffy's comment having the | at the end means also that you had the unescaped - not at the end, which means it was describing a range of characters where the actual range depends on the position you had it in the set.
another approach is to define the allowed chars
$ tr -cd '[:alnum:]:.\n' <file
foo.bar:2.4
bar.foo:1.1:test
hello.goobye:3.3.3
baz.yeah:4
or, perhaps delete all the prefix non-word chars
$ sed -E 's/\W+//' file

bash replacing character at certain position on a certain line

I have a file that looks like this:
[
{
"ncyc" : 28817,
"icels" : 128,
"jcels" : 128,
"t" : 0.185896E-006,
"dt" : 0.955602E-012,
"dtcour" : 0.100000E+021,
"dti" : 0.100000E+021,
"dtc" : 0.262902E-011,
"dtvol" : 0.239735E-010,
"dthall" : 0.100000E+021,
"dtlaser" : -0.925596E+062,
"dtmax" : 0.200000E-009,
}
]
I want to delete the last comma of this file. It appears at the 14th line at position 34. I could do this manually if it was one file but I have to do this for 300 files
sed is your friend:
sed -i.bak '14s/,[[:blank:]]*$//' file ...
This is a bit fragile: it assumes the line to remove is always the 14th, not necessarily the line before the closing brace.
Depending on the platform sed or awk might have varying results, perl might be more flexible:
perl -i.bak -00pe 's/,(?!.*,)//s' file
# , matches a comma.
# (?!.*,) negative lookahead asserts no comma after matched comma.
# s is a DOTALL modifier matching newline characters also.
This is a straightforward ed one-liner:
ed foo.json <<EOF
?,?s/,\([^,]*\)$/\1/
wq
EOF
That line can be broken into an address and a command.
The address is ?,?, namely the previous line matching the regular expression ,.
The command is s/re/replacement/, where the regular expression is ,\([^,]*\)$ (a literal ,, a captured group of zero or more character that are not ,, and the end of the line), and the replacement is \1 (the first captured group).
Technically it's a two-line ed script, wq to save and quit.
You could invoke this in a loop with find, for instance:
find . -name '*.json' | while read name ; do
ed -s $name <<EOF
H
[…ed commands…]
wq
EOF
done
I've also added ed -s to suppress the file size message, and H to output verbose errors instead of the infamous ?.
Thanks for the answers. I was easily able to solve the question myself using Python:
f=open(fjson, 'r')
data= f.readlines()
ndx=len(data)
data[ndx-3]= data[ndx-3].replace(',', '')

How to disallow nested comments in antlr

I currently have a multiline comment lexer rule in antlr which looks like:
MULTILINE: '/*' .* '*/' {$channel=HIDDEN;} ;
However, this currently allows things like:
/* /* hello */ */
Is there any possible way to disable nesting comments in antlr? I've tried various things like
MULTILINE: '/*' (~(MULTILINE)|.*) '*/' {$channel=HIDDEN;} ;
But that doesn't work. Any help would be much appreciated!
No, that is not correct: .* and .+ are not greedy.
Given the parser generated by the following grammar:
grammar T;
parse
: (t=. {System.out.printf("\%-15s'\%s'\n", tokenNames[$t.type], $t.text);} )* EOF
;
MULTILINE
: '/*' .* '*/' {$channel=HIDDEN;}
;
OTHER
: .
;
the input "/* /* hello */ */" would produce the following on your command line:
OTHER ' '
OTHER '*'
OTHER '/'
I.e., "/* /* hello */" is being put on the HIDDEN channel, and 3 OTHER tokens are constructed.
Try This:
It is not possible for the prefix nor suffix to be recognized in the comment body. Also, nesting is not allowed.
COMMENT_NON_NEST
: '/*'
( ('/'|'*'+)? ~[*/] )*?
('/'|'*'+?)?
'*/'
{$channel=HIDDEN;}
;

ANTLR comment problem

I am trying to write a comment matching rule in ANTLR, which is currently the following:
LINE_COMMENT
: '--' (options{greedy=false;}: .)* NEWLINE {Skip();}
;
NEWLINE : '\r'|'\n'|'\r\n' {Skip();};
This code works fine except in the case that a comment is the last characters of a file, in which case it throws a NoViableAlt exception. How can i fix this?
Why not:
LINE_COMMENT : '--' (~ NEWLINE)* ;
fragment NEWLINE : '\r' '\n'? | '\n' ;
If you haven't come across this yet, lexical rules (all uppercase) can only consist of constants and tokens, not other lexemes. You need a parser rule for that.
I'd go for:
LINE_COMMENT
: '--' ~( '\r' | '\n' )* {Skip();}
;
NEWLINE
: ( '\r'? '\n' | '\r' ) {Skip();}
;

Resources