What is the number of tokens in the following code snippet?

What is the number of tokens in the following code snippet? - compilation

printf ( "Some string here" , ++ i ++ && & i * * * a ) ;
I am confused, how to count the number of tokens for this code snippet. Basically, I am not getting how &&& and *** will be counted.
I think && is one token and & is one, while, *** are total 3 tokens, but i am not sure that it is right or not.
I have edited the code with whitespaces to seperate tokens .
Can someone explain with any technique, so that i can apply for any code snippet?
Any help will be greatly appreciated !!

C tokenization is "greedy" - it attempts to build the longest legal token first. See the online C 2011 draft standard, section 6.4.6 (Punctuators) for the list of legal punctuator tokens (&&, ++, etc.).
The sequence ++i++&&&i***a will be tokenized as ++, i, ++, &&, &, i, *, *, *, a. It will be parsed as (++(i++)) && ((&i) * (**a)), which is not a legal expression (the result of i++ is not an lvalue, so it cannot be the operand of the unary ++ operator).

Your analysis is correct. Tokenization in C is greedy, meaning that when &&& is encountered, the longest possible token && is scanned first. There is no ** token, so each * character is its own token.
The tokens are:
printf
(
"Some string here"
,
++
i
++
&&
&
i
*
*
*
a
)
;

Related

What do these symbols mean in the RFC docs regarding grammars?

Here are the examples:
Transfer-Encoding = "Transfer-Encoding" ":" 1#transfer-coding
Upgrade = "Upgrade" ":" 1#product
Server = "Server" ":" 1*( product | comment )
delta-seconds = 1*DIGIT
Via = "Via" ":" 1#( received-protocol received-by [ comment ] )
chunk-extension= *( ";" chunk-ext-name [ "=" chunk-ext-val ] )
http_URL = "http:" "//" host [ ":" port ] [ abs_path [ "?" query ]]
date3 = month SP ( 2DIGIT | ( SP 1DIGIT ))
Questions are:
What is the 1#transfer-coding (the 1# regarding the rule transfer-coding)? Same with 1#product.
What does 1 times x mean, as in 1*( product | comment )? Or 1*DIGIT.
What do the brackets mean, as in [ comment ]? The parens (...) group it all, but what about the [...]?
What does the *(...) mean, as in *( ";" chunk-ext-name [ "=" chunk-ext-val ] )?
What do the nested square brackets mean, as in [ abs_path [ "?" query ]]? Nested optional values? It doesn't make sense.
What does 2DIGIT and 1DIGIT mean, where do those come from / get defined?
I may have missed where these are defined, but knowing these would help clarify how to parse the grammar definitions they use in the RFCs.
I get the rest of the grammar notation, juts not these few remaining pieces.
Update: Looks like this is a good start.
Square brackets enclose an optional element sequence:
[foo bar]
is equivalent to
*1(foo bar).
Specific Repetition: nRule
A rule of the form:
<n>element
is equivalent to
<n>*<n>element
That is, exactly <n> occurrences of <element>. Thus, 2DIGIT is a
2-digit number, and 3ALPHA is a string of three alphabetic
characters.
Variable Repetition: *Rule
The operator "*" preceding an element indicates repetition. The full
form is:
<a>*<b>element
where <a> and <b> are optional decimal values, indicating at least
<a> and at most <b> occurrences of the element.
Default values are 0 and infinity so that *<element> allows any
number, including zero; 1*<element> requires at least one;
3*3<element> allows exactly 3; and 1*2<element> allows one or two.
But what I'm still missing is what the # means?
Update 2: Found it I think!
#RULE: LISTS
A construct "#" is defined, similar to "*", as follows:
<l>#<m>element
indicating at least <l> and at most <m> elements, each separated
by one or more commas (","). This makes the usual form of lists
very easy; a rule such as '(element *("," element))' can be shown
as "1#element".
Also, what do these mean?
1*2DIGIT
2*4DIGIT

antlr v3 lexer predicts wrong

consider the following (combined) grammar
grammar CastModifier;
tokens{
E='=';
C='=()';
Lp='(';
Rp=')';
I = 'int';
S=';';
}
compilationUnit
: assign+ EOF
;
assign
: '=' Int ';'
| '=' '(' 'int' ')' Int ';'
| '=()' Int ';'
;
Int
: ('1'..'9') ('0'..'9')*
| '0'
;
Whitespace
: (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;}
;
Unfortunately the lexer does not always predict the next token correctly. For instance for the following code
=(int) 1;
The Lexer predicts it must be the '=()' token. It detects the correct token for the following code
= (int) 1;
I figured this problem should be solvable by ANTLR if I provide the following option for the rule "assign":
options{k=3;}
But apparently it does not help and neither if I define this option for the whole grammar. How can I resolve this problem? My workaround at the moment is to built the '=()' token out of '='('')' but that allows the user to write
= ()
Which, well, is kind of ok but I am just wondering why ANTLR is not able to predict it correctly.

The options{k=3;} causes the parser to look ahead at most 3 tokens, not the lexer. ANTLR3's lexer is not that smart: once it matches a character, it won't give up on that match. So, in your case, from the input =(int) 1;, =( is matched but then there is an i in the char stream and no token that matches this.
My workaround at the moment is to built the '=()' token out of '='('')'
I'd call that a proper solution: '=' '(' and ')' are separate tokens and should be handled as such (to be glued together in the parser, not the lexer).

Ruby Parenthesis syntax exception with i++ ++i

Why does this throw a syntax error? I would expect it to be the other way around...
>> foo = 5
>> foo = foo++ + ++foo
=> 10 // also I would expect 12...
>> foo = (foo++) + (++foo)
SyntaxError: <main>:74: syntax error, unexpected ')'
foo = (foo++) + (++foo)
^
<main>:75: syntax error, unexpected keyword_end, expecting ')'
Tried it with tryruby.org which uses Ruby 1.9.2.
In C# (.NET 3.5) this works fine and it yields another result:
var num = 5;
var foo = num;
foo = (foo++) + (++foo);
System.Diagnostics.Debug.WriteLine(foo); // 12
I guess this is a question of operator priority? Can anybody explain?
For completeness...
C returns 10
Java returns 12

There's no ++ operator in Ruby. Ruby is taking your foo++ + ++foo and taking the first of those plus signs as a binary addition operator, and the rest as unary positive operators on the second foo.
So you are asking Ruby to add 5 and (plus plus plus plus) 5, which is 5, hence the result of 10.
When you add the parentheses, Ruby is looking for a second operand (for the binary addition) before the first closing parenthesis, and complaining because it doesn't find one.
Where did you get the idea that Ruby supported a C-style ++ operator to begin with? Throw that book away.

Ruby does not support this syntax. Use i+=1 instead.
As #Dylan mentioned, Ruby is reading your code as foo + (+(+(+(+foo)))). Basically it's reading all the + signs (after the first one) as marking the integer positive.

Ruby does not have a ++ operator. In your example it just adds the second foo, "consuming" one plus, and treats the other ones as unary + operators.

Remove whitespace, but do so last?

I am attempting to parse Lua, which depends on whitespace in some cases due to the fact that it doesn't use braces for scope. I figure that by throwing out whitespace only if another rule doesn't match is the best way, but i have no clue how to do that. Can someone help me?

Looking at Lua's documentation, I see no need to take spaces into account.
The parser rule ifStatement:
ifStatement
: 'if' exp 'then' block ('elseif' exp 'then' block 'else' block)? 'end'
;
exp
: /* todo */
;
block
: /* todo */
;
should match both:
if j==10 then print ("j equals 10") end
and:
if j<10 then
print ("j < 10")
elseif j>100 then
print ("j > 100")
else
print ("j >= 10 && j <= 100")
end
No need to take spaces into account, AFAIK. So you can just add:
Space
: (' ' | '\t' | '\r' | '\n'){$channel=HIDDEN;}
;
in your grammar.
EDIT
It seems there is a Lua grammar on the ANTLR wiki: http://www.antlr.org/grammar/1178608849736/Lua.g
And it seems I my suggestion for an if statement slightly differs from the grammar above:
'if' exp 'then' block ('elseif' exp 'then' block)* ('else' block)? 'end'
which is the correct one, as you can see.

ANTLR parse problem

I need to be able to match a certain string ('[' then any number of equals signs or none then '['), then i need to match a matching close bracket (']' then the same number of equals signs then ']') after some other match rules. ((options{greedy=false;}:.)* if you must know). I have no clue how to do this in ANTLR, how can i do it?
An example: I need to match [===[whatever arbitrary text ]===] but not [===[whatever arbitrary text ]==].
I need to do it for an arbitrary number of equals signs as well, so therein lies the problem: how do i get it to match an equal number of equals signs in the open as in the close? The supplied parser rules so far dont seem to make sense as far as helping.

You can't easely write a lexer for it, you need parsing rules. Two rules should be sufficient. One is responsible for matching the braces, one for matching the equal signs.
Something like this:
braces : '[' ']'
| '[' equals ']'
;
equals : '=' equals '='
| '=' braces '='
;
This should cover the use case you described. Not absolute shure but maybe you have to use a predicate in the first rule of 'equals' to avoid ambiguous interpretations.
Edit:
It is hard to integrate your greedy rule and at the same time avoid a lexer context switch or something similar (hard in ANTLR). But if you are willing to integrate a little bit of java in your grammer you can write an lexer rule.
The following example grammar shows how:
grammar TestLexer;
SPECIAL : '[' { int counter = 0; } ('=' { counter++; } )+ '[' (options{greedy=false;}:.)* ']' ('=' { counter--; } )+ { if(counter != 0) throw new RecognitionException(input); } ']';
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
rule : ID
| SPECIAL
;

Your tags mention lexing, but your question itself doesn't. What you're trying to do is non-regular, so I don't think it can be done as part of lexing (though I don't remember if ANTLR's lexer is strictly regular -- it's been a couple of years since I last used ANTLR).
What you describe should be possible in parsing, however. Here's the grammar for what you described:
thingy : LBRACKET middle RBRACKET;
middle : EQUAL middle EQUAL
| LBRACKET RBRACKET;

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

What is the number of tokens in the following code snippet? - compilation

Your analysis is correct. Tokenization in C is greedy, meaning that when &&& is encountered, the longest possible token && is scanned first. There is no ** token, so each * character is its own token. The tokens are: printf ( "Some string here" , ++ i ++ && & i * * * a ) ;

Related

What do these symbols mean in the RFC docs regarding grammars?

antlr v3 lexer predicts wrong

Ruby Parenthesis syntax exception with i++ ++i

Remove whitespace, but do so last?

ANTLR parse problem

Categories

Resources