Lightyear parser behaves in unexpected way - whitespace

I am trying to build a formatter for Idris with Lightyear.
The whole program so far is here:
https://github.com/hejfelix/IdrisFMT/blob/501a4a9e8b1b4154ed0d7836676c24d98de8b76a/IdrisFmt.idr
For now, the purpose is to tokenize the file itself and then pretty print it, i.e., the file as input should be a fixpoint.
The problem comes after each string literal, where my parser seems to eat up whitespace. If I put anything else than whitespace immediately after a string literal, it will parse both that character as well as all the following whitespace.
This sample program will show the error:
main2 : IO ()
main2 = putStrLn $ str
where
str = case parse tokenParser "\"IdrisFMT.idr\" \n" of
(Left l) => "failed" ++ show l
(Right r) => show $ map (show #{default}) r
This prints out:
*IdrisFMT> :exec main2
["StringLiteral(\"IdrisFMT.idr\")"]
If I change the string I'm parsing to "\"IdrisFMT.idr\"c \n", I get:
*IdrisFMT> :exec main2
["StringLiteral(\"IdrisFMT.idr\")", "Identifier(c)", "' '", "'\\n'"]
which is what I expected.
I believe the error arises from the way I parse the string literals, but I am failing to understand my mistake, and I'm having trouble finding a good way to debug lightyear parsers.
The implementation of my string literal parser is as follows:
escape : Parser String
escape = do
d <- char '\\'
c <- oneOf "\\\"0nrvtbf"
pure $ pack $ (the $ List Char) [d,c]
nonEscape : Parser String
nonEscape = map (\x => pack $ (the $ List _) [x]) $ noneOf "\\\"\0\n\r\v\t\b\f"
character : Parser String
character = nonEscape <|>| escape
stringLiteralToken : Parser Token
stringLiteralToken = map (StringLiteral . concat) $ dquote (many character)
How can I prevent my string literal parser from eating up whitespace after the literal?

After chatting on the #idris channel, I was helped to understand that most of the built-in higher order parsers (e.g.dquote) skip whitespace at the end.
In my case, this was not what I wanted. Instead, I used the between function which takes 3 parameters, a parser for when to start, another for when to stop, and a third for whatever comes in between.
To parse string literals, I am now doing this:
escape : Parser String
escape = do
d <- char '\\'
c <- oneOf "\\\"0nrvtbf'"
pure $ pack $ (the $ List Char) [d,c]
nonEscape : Parser String
nonEscape = map (\x => pack $ (the $ List _) [x]) $ noneOf "\\\"\0\n\r\v\t\b\f"
character : Parser String
character = nonEscape <|>| escape
stringLiteralToken : Parser Token
stringLiteralToken = map (StringLiteral . concat) $ (between (char '"') (char '"')) (many character)
This solved my problem.

Related

Match any character, including special characters using the Match function

I have a relatively weird file names,
but the glob function doesn't seem to pick them up using usual wildcards:
fmt.Println(filepath.Match("/home/catch/*.xml", "/home/catch/{foo/x/y}.xml"))
so, I want to match xml files in the catch folders, and they might have special characters in their name, like {path1/path2}.xml
Sadly, the * wildcard won't match that since I assume slashes and maybe curly braces are considered as non-separator characters?
I'm using Linux and with this pattern Match is returning true . So files like that {path1/path2}.xml (although it is not common), it is possible to match them.
package main
import (
"fmt"
"path/filepath"
)
func main() {
fmt.Println(filepath.Match("/home/catch/*\\/*", "/home/catch/{path1/path2}.xml"))
}
Output
true <nil>
But on Windows, escaping is disabled. Instead, \\ is treated as path separator, so it will not work.
"/home/catch/*\\/*"
You can read it in the documentation
pattern:
{ term }
term:
'*' matches any sequence of non-Separator characters
'?' matches any single non-Separator character
'[' [ '^' ] { character-range } ']'
character class (must be non-empty)
c matches character c (c != '*', '?', '\\', '[')
'\\' c matches character c
character-range:
c matches character c (c != '\\', '-', ']')
'\\' c matches character c
lo '-' hi matches character c for lo <= c <= hi
Match requires pattern to match all of name, not just a substring. The only possible returned error is ErrBadPattern, when pattern is malformed.
On Windows, escaping is disabled. Instead, '\\' is treated as path separator.
https://pkg.go.dev/path/filepath#Match

The letter disapperaed after Splitting string in my ruby program

I am newbie in ruby. In my ruby program, there is a part of code for parsing geocode. The code is like below:
string = "GPS:3;S23.164865;E113.428970;88"
info = string.tr("GPS:",'')
info_array = info.split(";")
puts "GPS: #{info_array[0]},#{info_array[1]},#{info_array[2]}"
The code should split the string into 3 piece: 3, S23.164865 and E113.428970;88 and the expected output is
GPS: 3,S23.164865,E113.428970
but the result is:
GPS: 3,23.164865,E113.428970
Yes, the 'S' letter disappered...
If I use
string = "GPS:3;N23.164865;E113.428970;88"
info = string.tr("GPS:",'')
info_array = info.split(";")
puts "GPS: #{info_array[0]},#{info_array[1]},#{info_array[2]}"
, it prints expected result
GPS: 3,N23.164865,E113.428970
I am very confused why this happens. Can you help?
It looks like you were expecting String#tr to behave like String#gsub.
Calling string.tr("GPS:", '') does not replace the complete string "GPS:" with the empty string. Instead, it replaces any character from within the string "GPS:" with an empty string. Commonly you will find .tr() called with an equal number of input and replacement characters, and in that case the input character is replaced by the output character in the corresponding position. But the way you have called it with only the empty string '' as its translation argument, will delete any of G, P, S, : from anywhere within the string.
>> "String with S and G and a: P".tr("GPS:", '')
=> "tring with and and a "
Instead, use .gsub('GPS:', '') to replace the complete match as a group.
string = "GPS:3;S23.164865;E113.428970;88"
info = string.gsub('GPS:', '')
info_array = info.split(";")
puts "GPS: #{info_array[0]},#{info_array[1]},#{info_array[2]}"
# prints
GPS: 3,S23.164865,E113.428970
Here we've called .gsub() with a string argument. It is probably more often called with a regexp search match argument though.

antlr grammar: Allow whitespace matching only in template string

I want to parse template strings:
`Some text ${variable.name} and so on ... ${otherVariable.function(parameter)} ...`
Here is my grammar:
varname: VAR ;
variable: varname funParameter? ('.' variable)* ;
templateString: '`' (TemplateStringLiteral* '${' variable '}' TemplateStringLiteral*)+ '`' ;
funParameter: '(' variable? (',' variable)* ')' ;
WS : [ \t\r\n\u000C]+ -> skip ;
TemplateStringLiteral: ('\\`' | ~'`') ;
VAR : [$]?[a-zA-Z0-9_]+|[$] ;
When the input for the grammar is parsed, the template string has no whitespaces anymore because of the WS -> skip. When I put the TemplateStringLiteral before WS, I get the error:
extraneous input ' ' expecting {'`'}
How can I allow whitespaces to be parsed and not skipped only inside the template string?
What is currently happening
When testing your example against your current grammar displaying the generated tokens, the lexer gives this:
[#0,0:0='`',<'`'>,1:0]
[#1,1:4='Some',<VAR>,1:1]
[#2,6:9='text',<VAR>,1:6]
[#3,11:12='${',<'${'>,1:11]
[#4,13:20='variable',<VAR>,1:13]
[#5,21:21='.',<'.'>,1:21]
[#6,22:25='name',<VAR>,1:22]
[#7,26:26='}',<'}'>,1:26]
... shortened ...
[#26,85:84='<EOF>',<EOF>,2:0]
This tells you, that Some which you intended to be TemplateStringLiteral* was actually lexed to be VAR. Why is this happening?
As mentioned in this answer, antlr uses the longest possible match to create a token. Since your TemplateStringLiteral rule only matches single characters, but your VAR rule matches infinitely many, the lexer obviously uses the latter to match Some.
What you could try (Spoiler: won't work)
You could try to modify the rule like this:
TemplateStringLiteral: ('\\`' | ~'`')+ ;
so that it captures more than one character and therefore will be preferred. This has two reasons why it does not work:
How would the lexer match anything to the VAR rule, ever?
The TemplateStringLiteral rule now also matches ${ therefore prohibiting the correct recognition of the start of a template chunk.
How to achieve what you actually want
There might be another solution, but this one works:
File MartinCup.g4:
parser grammar MartinCup;
options { tokenVocab=MartinCupLexer; }
templateString
: BackTick TemplateStringLiteral* (template TemplateStringLiteral*)+ BackTick
;
template
: TemplateStart variable TemplateEnd
;
variable
: varname funParameter? (Dot variable)*
;
varname
: VAR
;
funParameter
: OpenPar variable? (Comma variable)* ClosedPar
;
File MartinCupLexer.g4:
lexer grammar MartinCupLexer;
BackTick : '`' ;
TemplateStart
: '${' -> pushMode(templateMode)
;
TemplateStringLiteral
: '\\`'
| ~'`'
;
mode templateMode;
VAR
: [$]?[a-zA-Z0-9_]+
| [$]
;
OpenPar : '(' ;
ClosedPar : ')' ;
Comma : ',' ;
Dot : '.' ;
TemplateEnd
: '}' -> popMode;
This grammar uses lexer modes to differentiate between the inside and the outside of the curly braces. The VAR rule is now only active after ${ has been encountered and only stays active until } is read. It thereby does not catch non-template text like Some.
Notice that the use of lexer modes requires a split grammar (separate files for parser and lexer grammars). Since no lexer rules are allowed in a parser grammar, I had to introduce tokens for the parentheses, comma, dot and backticks.
About the whitespaces
I assume you want to keep whitespaces inside the "normal text", but not allow whitespace inside the templates. Therefore I simply removed the WS rule. You can always re-add it if you like.
I tested your alternative grammar, where you put TemplateStringLiteral above WS, but contrary to your observation, this gives me:
line 1:1 extraneous input 'Some' expecting {'${', TemplateStringLiteral}
The reason for this is the same as above, Some is lexed to VAR.

How to make an ANTLR3 VBScript parser see '&htmlTag' as two tokens?

While parsing VBScript code with my ANTLR3 parser, I found it processes everything except
x = y &htmlTag
This code is obviously meant as "x = y & htmlTag". (Me, I put spaces around operators in any language, but the code I am parsing is not mine.) The lexer should find the longest string that is a valid token, right? So that should work fine here: As '&h' is not followed by text that results in a hex literal, the lexer should decide that this is not a hex literal and the longest valid token is the operator '&'. Followed by an identifier.
But if my grammar says:
HexOrOctalLiteral :
( ( AMPERSAND H HexDigit ) => AMPERSAND H HexDigit+
| ( AMPERSAND O OctalDigit ) => AMPERSAND O OctalDigit+
)
AMPERSAND?
;
ConcatenationOperator: AMPERSAND;
fragment AMPERSAND : '&';
fragment HexDigit : Digit | A | B | C | D | E | F;
fragment OctalDigit : '0' .. '7';
fragment H : 'h' | 'H';
My parser complains: required (...)+ loop did not match anything at character 'h' when processing the '&htmlTag'. It appears the lexer has already decided that it has found a HexOrOctalLiteral and will no longer consider a concat operator. My grammar has k=1, not sure if that is relevant here because setting it higher for this rule using 'options' seems to make no difference.
What am I missing?

What is the opposite of Regexp.escape?

What is the opposite of Regexp.escape ?
> Regexp.escape('A & B')
=> "A\\ &\\ B"
> # do something, to get the next result: (something like Regexp.unescape(A\\ &\\ B))
=> "A & B"
How can I get the original value?
replaces = Hash.new { |hash,key| key } # simple trick to return key if there is no value in hash
replaces['t'] = "\t"
replaces['n'] = "\n"
replaces['r'] = "\r"
replaces['f'] = "\f"
replaces['v'] = "\v"
rx = Regexp.escape('A & B')
str = rx.gsub(/\\(.)/){ replaces[$1] }
Also make sure to #puts output in irb, because #inspect escapes characters by default.
Basically escaping/quoting looks for meta-characters, and prepends \ character (which has to be escaped for string interpretation in source code). But if we find any control character from list: \t, \n, \r, \f, \v, then quoting outputs \ character followed by this special character translated to ascii.
UPDATE:
My solution had problems with special characters (\n, \t ans so on), I updated it after investigating source code for rb_reg_quote method.
UPDATE 2:
replaces is hash, which converts escaped characters (thats why it is used in block attached to gsub) to unescaped ones. It is indexed by character without escape character (second character in sequence) and searches for unescaped value. The only defined values are control-characters, but there is also default_proc attached (block attached to Hash.new), which returns key if there is no value found in hash. So it works like this:
for "n" it returns "\n", the same for all other escaped control characters, because it is value associated with key
for "(" it returns "(", because there is no value associated with "(" key, hash calls #default_proc, which returns key itself
The only characters escaped by Regexp.escape are meta characters and control characters, so we don't have to worry about alphanumerics.
Take a look at http://ruby-doc.org/core-2.0.0/Hash.html#method-i-default_proc for documentation on #defoult_proc
You can perhaps use something like this?
def unescape(s)
eval %Q{"#{s}"}
end
puts unescape('A\\ &\\ B')
Credits to this question.
codepad demo
If you are okay with a regex solution, you can use this:
res = s.gsub(/\\(?!\\)|(\\)\\/, "\\1")
codepad demo
Try this
>> r = Regexp.escape("A & B (and * c [ e] + )")
# => "A\\ &\\ B\\ \\(and\\ \\*\\ c\\ \\[\\ e\\]\\ \\+\\ \\)"
>> r.gsub("\\(","(").gsub("\\)",")").gsub("\\[","[").gsub("\\]","]").gsub("\\{","{").gsub("\\}","}").gsub("\\.",".").gsub("\\?","?").gsub("\\+","+").gsub("\\*","*").gsub("\\ "," ")
# => "A & B (and * c [ e] + )"
Basically, these (, ), [, ], {, }, ., ?, +, * are the meta characters in regex. And also \ which is used as an escape character.
The chain of gsub() calls replace the escaped patterns with corresponding actual value.
I am sure there is a way to DRY this up.
Update: DRY version as suggested by user2503775
>> r.gsub("\\","")
Update:
following are the special characters in regex
[,],{,},(,),|,-,*,.,\\,?,+,^,$,<space>,#,\t,\f,\v,\n,\r
using a regex replace using \\(?=([\\\*\+\?\|\{\[\(\)\^\$\.\#\ ]))\
should give you the string unescaped, you would only have to replace \r\n sequences with there CrLf counterparts.
"There\ is\ a\ \?\ after\ the\ \(white\)\ car\.\ \r\n\ it\ should\ be\ http://car\.com\?\r\n"
is unescaped to :
"There is a ? after the (white) car. \r\n it should be http://car.com?\r\n"
and removing the \r\n gives you :
There is a ? after the (white) car.
it should be http://car.com?

Resources