Why is this Perl 6 feed operator a "bogus statement"? - whitespace

I took this example from Day 10 – Feed operators of the Perl 6 2010 Advent Calendar with the slight change of .uc for the .ucfirst that's no longer there:
my #rakudo-people = <scott patrick carl moritz jonathan jerry stephen>;
#rakudo-people
==> grep { /at/ } ==> map { .uc } ==> my #who-it's-at;
say ~#who-it's-at;
I write it slightly differently with some additional whitespace:
my #rakudo-people = <scott patrick carl moritz jonathan jerry stephen>;
#rakudo-people
==> grep { /at/ }
==> map { .uc } ==> my #who-it's-at;
say ~#who-it's-at;
Now it's a "bogus statement":
===SORRY!=== Error while compiling ...
Bogus statement
------> ==> grep { /at/ }⏏<EOL>
expecting any of:
postfix
prefix
statement end
term
This isn't a problem with just this example. Some examples in the current docs can exhibit the same behavior.
If I add an unspace to the end of the offending line, it works again:
my #rakudo-people = <scott patrick carl moritz jonathan jerry stephen>;
#rakudo-people
==> grep { /at/ } \
==> map { .uc } ==> my #who-it's-at;
say ~#who-it's-at;
Curiously, a comment at the end of that line does not work. I would have thought it would have eaten the offending whitespace.
The feed operator says:
In the case of routines/methods that take a single argument or where the first argument is a block, it's often required that you call with parentheses
That works:
my #rakudo-people = <scott patrick carl moritz jonathan jerry stephen>;
#rakudo-people
==> grep( { /at/ } )
==> map { .uc } ==> my #who-it's-at;
say ~#who-it's-at;
But why wasn't that a problem in the first form? What is the whitespace doing here? And what situations are included in "often required"?

From the Perl 6 documentation on Separating Statements:
A closing curly brace followed by a newline character implies a statement separator
In other words, whenever the closing brace of a block is the last thing in a line (not counting whitespace and comments), then the parser implicitly inserts a semicolon after it.
This allows us to write things like the following without a semicolon after the }:
my #foo = #bar.map: {
...
}
The semicolon would be ugly there, and if you had first written the loop as for #bar { ... } and then decided to turn it into a map with assignment like this, adding the trailing semicolon would be annoying and easy to forget. So for the most part, this automatic statement termination after end-of-line blocks is helpful.
However, feed operators feeding into blocks are one case (possibly the only one) where it trips people up.
To prevent it from happening, insert \ (a.k.a. unspace) after the block, as you've already noted. The unspace makes the whitespace which includes the newline invisible to the parser, and thus the aforementioned newline-based parser rule won't be applied.

Related

Rascal: resolve ambiguity with comment

Consider the following grammar:
module Tst
lexical Id = [a-z][a-z0-9]* !>> [a-z0-9];
layout Layout = WhitespaceAndComment* !>> [\ \t\n\r];
lexical WhitespaceAndComment
= [\ \t\n\r]
| #category="Comment" ^ "*" ![\n]* $
;
start syntax TstStart = Id*;
then
start[TstStart] t = parse(#start[TstStart], "*bla\nABC");
gives an ambiguity, probably because the comment can be placed before or after the empty list of strings.
So, I have 2 questions:
How can I use diagnose() to get a diagnosis? I have tried diagnose(t) and diagnose(parse(#start[TstStart], "*bla\nABC")), without success.
What is the ambiguity and how can I resolve it?
Sorry, it has been a while ago. The comment definition contains a flaw, it has to be corrected as follows:
#category="Comment" ^ "*" ![\n]* $
This resolves the ambiguity, but I still would like to know how to use diagnosis().
The ambiguity is caused by ![*]* which can eat a \n or not, because the whitespace notion can also eat the \n or not.
The diagnose function is notoriously bad at spotting issues with whitespace. That can be improved.
The solution is to use ![*\n]* and end the comment line with "\n"
Another ambiguity will happen with actual * comments in the code, between String* and Id* the comments would go before or after the empty lists. To fix this: layout Layout = WhitespaceAndComment* !>> [\ \t\n\r] !>> [*]; adding the follow restriction with the [*] will help.
When I make the example slightly more complicated I get another ambiguity.
module Tst
lexical Id = [a-z][a-z0-9]* !>> [a-z0-9];
lexical String = "\"" ![\"]* "\"";
layout Layout = WhitespaceAndComment* !>> [\ \t\n\r];
lexical WhitespaceAndComment
= [\ \t\n\r]
| #category="Comment" ^ "*" ![\n]* $
;
start syntax TstStart = String* Id*;
Then "*bla\nABC" gives an ambiguity again. Probably because the comment can be placed before, within and after the empty list of strings. How to resolve it?

Starting a new cycle if condition is met in sed

I am performing several commands (GNU sed) on a line and if certain condition is met, I want to skip rest of the commands.
Example:
I want to substitute all d with 4
If line start with A, C or E, skip the rest of the commands (another substitutions etc)
I want to use basic regular expressions only. If I could use extended regex, this would be trivial:
sed -r 's/d/4/g; /^(A|C|E)/! { s/a/1/g; s/b/2/g; s/c/3/g }' data
Now, with BRE, this will work fine but for more conditions, it will be really ugly:
sed 's/d/4/g; /^A/! { /^C/! { /^E/! { s/a/1/g; s/b/2/g; s/c/3/g } } }' data
Example input:
Aaabbccdd
Baabbccdd
Caabbccdd
Daabbccdd
Eaabbccdd
Example output:
Aaabbcc44
B11223344
Caabbcc44
D11223344
Eaabbcc44
This is just an example. I am not looking for different ways to approach the problem. I want to know some better ways to start a new cycle.
I suggest to use b:
/^\(A\|C\|E\)/b
From man sed:
b label: Branch to label; if label is omitted, branch to end of script.

sed to get string between two patterns

I am working on a latex file from which I need to pick out the references marked by \citep{}. This is what I am doing using sed.
cat file.tex | grep citep | sed 's/.*citep{\(.*\)}.*/\1/g'
Now this one works if there is only one pattern in a line. If there are more than one patterns i.e. \citep in a line, it fails. It fails even when there is only one pattern but more than one closing bracket }. What should I do, so that it works for all the patterns in a line and also for the exclusive bracket I am looking for?
I am working on bash. And a part of the file looks like this:
of the Asian crust further north \citep{TapponnierM76, WangLiu2009}. This has led to widespread deformation both within and
\citep{BilhamE01, Mitraetal2005} and by distributed seismicity across the region (Fig. \ref{fig1_2}). Recent GPS Geodetic
across the Dawki fault and Naga Hills, increasing eastwards from $\sim$3~mm/yr to $\sim$13~mm/yr \citep{Vernantetal2014}.
GPS velocity vectors \citep{TapponnierM76, WangLiu2009}. Sikkim Himalaya lies at the transition between this relatively simple
this transition includes deviation of the Himalaya from a perfect arc beyond 89\deg\ longitude \citep{BendickB2001}, reduction
\citep{BhattacharyaM2009, Mitraetal2010}. Rivers Tista, Rangit and Rangli run through Sikkim eroding the MCT and Ramgarh
thrust to form a mushroom-shaped physiography \citep{Mukuletal2009,Mitraetal2010}. Within this sinuous physiography,
\citep{Pauletal2015} and also in accordance with the findings of \citet{Mitraetal2005} for northeast India. In another study
field results corroborate well with seismic studies in this region \citep{Actonetal2011, Arunetal2010}. From studies of
On one line, I get answer like this
BilhamE01, TapponnierM76} and by distributed seismicity across the region (Fig. \ref{fig1_2
whereas I am looking for
BilhamE01, TapponnierM76
Another example with more than one /citep patterns gives output like this
Pauletal2015} and also in accordance with the findings of \citet{Mitraetal2005} for northeast India. In another study
whereas I am looking for
Pauletal2015 Mitraetal2005
Can anyone please help?
it's a greedy match change the regex match the first closing brace
.*citep{\([^}]*\)}
test
$ echo "\citep{string} xyz {abc}" | sed 's/.*citep{\([^}]*\)}.*/\1/'
string
note that it will only match one instance per line.
If you are using grep anyway, you can as well stick with it (assuming GNU grep):
$ echo $str | grep -oP '(?<=\\citep{)[^}]+(?=})'
BilhamE01, TapponierM76
For what it's worth, this can be done with sed:
echo "\citep{string} xyz {abc} \citep{string2},foo" | \
sed 's/\\citep{\([^}]*\)}/\n\1\n\n/g; s/^[^\n]*\n//; s/\n\n[^\n]*\n/, /g; s/\n.*//g'
output:
string, string2
But wow, is that ugly. The sed script is more easily understood in this form, which happens to be suitable to be fed to sed via a -f argument:
# change every \citep{string} to <newline>string<newline><newline>
s/\\citep{\([^}]*\)}/\n\1\n\n/g
# remove any leading text before the first wanted string
s/^[^\n]*\n//
# replace text between wanted strings with comma + space
s/\n\n[^\n]*\n/, /g
# remove any trailing unwanted text
s/\n.*//
This makes use of the fact that sed can match and sub the newline character, even though reading a new line of input will not result in a newline initially appearing in the pattern space. The newline is the one character that we can be certain will appear in the pattern space (or in the hold space) only if sed puts it there intentionally.
The initial substitution is purely to make the problem manageable by simplifying the target delimiters. In principle, the remaining steps could be performed without that simplification, but the regular expressions involved would be horrendous.
This does assume that the string in every \citep{string} contains at least one character; if the empty string must be accommodated, too, then this approach needs a bit more refinement.
Of course, I can't imagine why anyone would prefer this to #Lev's straight grep approach, but the question does ask specifically for a sed solution.
f.awk
BEGIN {
pat = "\\citep"
latex_tok = "\\\\[A-Za-z_][A-Za-z_]*" # match \aBcD
}
{
f = f $0 # store content of input file as a sting
}
function store(args, n, k, i) { # store `keys' in `d'
gsub("[ \t]", "", args) # remove spaces
n = split(args, keys, ",")
for (i=1; i<=n; i++) {
k = keys[i]
d[k]
}
}
function ntok() { # next token
if (match(f, latex_tok)) {
tok = substr(f, RSTART ,RLENGTH)
f = substr(f, RSTART+RLENGTH-1 )
return 1
}
return 0
}
function parse( i, rc, args) {
for (;;) { # infinite loop
while ( (rc = ntok()) && tok != pat ) ;
if (!rc) return
i = index(f, "{")
if (!i) return # see `pat' but no '{'
f = substr(f, i+1)
i = index(f, "}")
if (!i) return # unmatched '}'
# extract `args' from \citep{`args'}
args = substr(f, 1, i-1)
store(args)
}
}
END {
parse()
for (k in d)
print k
}
f.example
of the Asian crust further north \citep{TapponnierM76, WangLiu2009}. This has led to widespread deformation both within and
\citep{BilhamE01, Mitraetal2005} and by distributed seismicity across the region (Fig. \ref{fig1_2}). Recent GPS Geodetic
across the Dawki fault and Naga Hills, increasing eastwards from $\sim$3~mm/yr to $\sim$13~mm/yr \citep{Vernantetal2014}.
GPS velocity vectors \citep{TapponnierM76, WangLiu2009}. Sikkim Himalaya lies at the transition between this relatively simple
this transition includes deviation of the Himalaya from a perfect arc beyond 89\deg\ longitude \citep{BendickB2001}, reduction
\citep{BhattacharyaM2009, Mitraetal2010}. Rivers Tista, Rangit and Rangli run through Sikkim eroding the MCT and Ramgarh
thrust to form a mushroom-shaped physiography \citep{Mukuletal2009,Mitraetal2010}. Within this sinuous physiography,
\citep{Pauletal2015} and also in accordance with the findings of \citet{Mitraetal2005} for northeast India. In another study
field results corroborate well with seismic studies in this region \citep{Actonetal2011, Arunetal2010}. From studies of
Usage:
awk -f f.awk f.example
Expected ouput:
BendickB2001
Arunetal2010
Pauletal2015
Mitraetal2005
BilhamE01
Mukuletal2009
TapponnierM76
WangLiu2009
BhattacharyaM2009
Mitraetal2010
Actonetal2011
Vernantetal2014

How to print comments in lex?

So the title might be a little bit misleading, but I can't think of any better way to phrase it.
Basically, I'm writing a lexical-scanner using cygwin/lex. A part of the code reads a token /* . It the goes into a predefined state C_COMMENT, and ends when C_COMMENT"/*". Below is the actual code
"/*" {BEGIN(C_COMMENT); printf("%d: /*", linenum++);}
<C_COMMENT>"*/" { BEGIN(INITIAL); printf("*/\n"); }
<C_COMMENT>. {printf("%s",yytext);}
The code works when the comment is in a single line, such as
/* * Example of comment */
It will print the current line number, with the comment behind. But it doesn't work if the comment spans multiple lines. Rewriting the 3rd line into
<C_COMMENT>. {printf("%s",yytext);
printf("\n");}
doesn't work. It will result in \n printed for every letter in the comment. I'm guessing it has something to do with C having no strings or maybe I'm using the states wrong.
Hope someone will be able to help me out :)
Also if there's any other info you need, just ask, and I'll provide.
The easiest way to echo the token scanned by a pattern is to use the special action ECHO:
"/*" { printf("%d: ", linenum++); ECHO; BEGIN(C_COMMENT); }
<C_COMMENT>"*/" { ECHO; BEGIN(INITIAL); }
<C_COMMENT>. { ECHO; }
None of the above rules matches a newline inside a comment, because in (f)lex . doesn't match newlines:
<C_COMMENT>\n { linenum++; ECHO; }
A faster way of recognizing C comments is with a single regular expression, although it's a little hard to read:
[/][*][^*]*[*]+([^/*][^*][*]+)*[/]
In this case, you'll have to rescan the comment to count newlines, unless you get flex to do the line number counting.
flex scanners maintain a line number count in yylineno, if you request that feature (using %option yylineno). It's often more efficient and always more reliable than keeping the count yourself. However, in the action, the value of yylineno is the line number count at the end of the pattern, not at the beginning, which can be misleading for multiline patterns. A common workaround is to save the value of yylineno in another variable at the beginning of the token scan.

Ruby parslet: parsing multiple lines

I'm looking for a way to match multiple lines Parslet.
The code looks like this:
rule(:line) { (match('$').absent? >> any).repeat >> match('$') }
rule(:lines) { line.repeat }
However, lines will always end up in an infinite loop which is because match('$') will endlessly repeat to match end of string.
Is it possible to match multiple lines that can be empty?
irb(main)> lines.parse($stdin.read)
This
is
a
multiline
string^D
should match successfully. Am I missing something? I also tried (match('$').absent? >> any.maybe).repeat(1) >> match('$') but that doesn't match empty lines.
Regards,
Danyel.
I usually define a rule for end_of_line. This is based on the trick in http://kschiess.github.io/parslet/tricks.html for matching end_of_file.
class MyParser < Parslet::Parser
rule(:cr) { str("\n") }
rule(:eol?) { any.absent? | cr }
rule(:line_body) { (eol?.absent? >> any).repeat(1) }
rule(:line) { cr | line_body >> eol? }
rule(:lines?) { line.repeat (0)}
root(:lines?)
end
puts MyParser.new.parse(""" this is a line
so is this
that was too
This ends""").inspect
Obviously if you want to do more with the parser than you can achieve with String::split("\n") you will replace the line_body with something useful :)
I had a quick go at answering this question and mucked it up. I just though I would explain the mistake I made, and show you how to avoid mistakes of that kind.
Here is my first answer.
rule(:eol) { str('\n') | any.absent? }
rule(:line) { (eol.absent? >> any).repeat >> eol }
rule(:lines) { line.as(:line).repeat }
I didn't follow my usual rules:
Always make repeat count explicit
Any rule that can match zero length strings, should have name ending in a '?'
So lets apply these...
rule(:eol?) { str('\n') | any.absent? }
# as the second option consumes nothing
rule(:line?) { (eol.absent? >> any).repeat(0) >> eol? }
# repeat(0) can consume nothing
rule(:lines?) { line.as(:line?).repeat(0) }
# We have a problem! We have a rule that can consume nothing inside a `repeat`!
Here see why we get an infinite loop. As the input is consumed, you end up with just the end of file, which matches eol? and hence line? (as the line body can be empty). Being inside lines' repeat, it keeps matching without consuming anything and loops forever.
We need to change the line rule so it always consumes something.
rule(:cr) { str('\n') }
rule(:eol?) { cr | any.absent? }
rule(:line_body) { (eol.absent? >> any).repeat(1) }
rule(:line) { cr | line_body >> eol? }
rule(:lines?) { line.as(:line).repeat(0) }
Now line has to match something, either a cr (for empty lines), or at least one character followed by the optional eol?. All repeats have bodies that consume something. We are now golden.
I think you have two, related, problems with your matching:
The pseudo-character match $ does not consume any real characters. You still need to consume the newlines somehow.
Parslet is munging the input in some way, making $ match in places you might not expect. The best result I could get using $ ended up matching each individual character.
Much safer to use \n as the end-of-line character. I got the following to work (I am somewhat of a beginner with Parslet myself, so apologies if it could be clearer):
require 'parslet'
class Lines < Parslet::Parser
rule(:text) { match("[^\n]") }
rule(:line) { ( text.repeat(0) >> match("\n") ) | text.repeat(1) }
rule(:lines) { line.as(:line).repeat }
root :lines
end
s = "This
is
a
multiline
string"
p Lines.new.parse( s )
The rule for the line is complex because of the need to match empty lines and a possible final line without a \n.
You don't have to use the .as(:line) syntax - I just added it to show clearly that the :line rule is matching each line individually, and not simply consuming the whole input.

Resources